This vignette shows how to use projpred
to perform variable selection in the
context of Generalized Linear Multilevel Models (GLMMs). For a general overview
of the package we refer the reader to the quickstart vignette. The method used
here is described in the latest preprint (by Catalina et al. (2020), arxiv
link).
Load the necessary packages. If the sampling takes more than 30 seconds and
multiple cores are available, uncomment the line setting mc.cores
to set the
number of cores used (this is commented out as the sampling in the example is
fast and to avoid possible problems when building the vignette along the package
installation in special environments such as computing clusters).
library(projpred)
library(rstanarm)
library(tidyr)
library(dplyr)
library(ggplot2)
library(bayesplot)
theme_set(theme_classic())
#options(mc.cores = 4)
For the first Gaussian example we borrow the popular orthodont
dataset from
the nlme
package. This data include observations from a growth curve on an
orthodontic measurement. It contains 108 observations with 4 variables. The
variables contained in this dataset are:
data("Orthodont", package = "nlme")
We first build the following complete rstanarm
model including all variables:
fit <- stan_glmer(distance ~ age * Sex + (age | Subject),
chains = 2, data = Orthodont, seed = 1)
Since we have repeated measurements over subjects at multiple time points, the
effect of age
is allowed to vary over subjects. To run projpred
on this
model we can first run a simple variable selection with the full dataset. This
approach may be overconfident in some cases and therefore in general cases it is
recommended to run a cross validated variable selection instead.
ref <- get_refmodel(fit)
vs <- varsel(ref)
Because our model includes group effects, the only available search method is forward search, which is a bit slower than L_1 search but tends to find smaller models. Beware that the underlying projection fitting method could return some warnings in cases where the optimization may not have converged properly. This is more common in the case of generalized models (non-Gaussian families). We print the list of the variables ordered by relevance:
solution_terms(vs) # selection order of the variables
We plot some statistics computed on the training data, such as the sum of log
predictive densities (ELPD) and root mean squared error (RMSE) as the function
of number of variables added. By default, the statistics are shown on absolute
scale, but with deltas = TRUE
the plot shows results relative to the full
model.
# plot predictive performance on training data
plot(vs, stats = c('elpd', 'rmse'))
From this plot, it is clearly visible that the first two terms are probaly
enough to achieve predictive performance comparable to the reference model. We
perform the projection for a submodel of desired size using the function
project
. The projection can also be coerced to a matrix with draws of the
selected variables and sigma. The draws can be visualized with, for example, the
mcmc_areas
function in the bayesplot
package. Below we compare how the
projection affects the three most relevant variables.
# Visualise the projected three most relevant variables
proj <- project(vs, nterms = 2, ns = 500)
mcmc_areas(as.matrix(proj), pars = solution_terms(vs)[1:2])
We make predictions with the projected submodels. For point estimates we can use
method proj_linpred
. Test inputs can be provided using the keyword newdata
.
If also the test targets ynew
are provided, then the function evaluates the
log predictive density at these points. For instance, the following computes the
mean of the predictive distribution and evaluates the log density at the
training points using the 5 most relevant variables.
pred <- proj_linpred(vs, newdata = Orthodont, nterms = 5, integrated = TRUE)
Visualize the predictions
ggplot() +
geom_point(aes(x = pred$pred, y = Orthodont$distance)) +
geom_abline(slope = 1, color = "red") +
labs(x = "prediction", y = "y")
We also obtain draws from the projected predictive distribution. Here's an example prediction for the first data point using 5 terms (the observed value is marked by the red line).
subset <- Orthodont %>% as_tibble() %>% dplyr::sample_n(1)
y_subset <- subset %>% dplyr::select(distance) %>% as.data.frame()
y1_rep <- proj_predict(vs, newdata = subset, nterms = 5, seed = 7560)
qplot(as.vector(y1_rep), bins = 25) +
geom_vline(xintercept = as.numeric(y_subset), color = "red") +
xlab("y1_rep")
As previously, we first load the data,
data_pois <- read.table("data_pois.csv", header = TRUE)
These data correspond to a phylogenetic dataset where the phenotype is expressed as counts. This kind of data is relevant in evolutionary biology when data of many species are analyzed at the same time. The model we fit here is borrowed from Modern Phylogenetic Comparative Methods and the application in Evolutionary Biology (de Villemeruil & Nakagawa, 2014). The necessary data can also be downloaded from the corresponding website (http://www.mpcm-evolution.com/). The specific formula is taken from the Estimating Phylogenetic Multilevel Models with brms vignette of the brms package (https://cran.r-project.org/package=brms/vignettes/brms_phylogenetics.html).
fit <- stan_glmer(
phen_pois ~ cofactor + (1 | phylo) + (1 | obs), data = data_pois,
family = poisson("log"), chains = 2, iter = 2000,
control = list(adapt_delta = 0.95)
)
As we did in the previous example, we can perform variable selection and look at the projection.
vs <- varsel(fit)
Because our model includes group effects, the only available search method is forward search, which is a bit slower than L_1 search but tends to find smaller models. Beware that the underlying projection fitting method could return some warnings in cases where the optimization may not have converged properly. This is more common in the case of generalized models (non-Gaussian families). We print the list of the variables ordered by relevance:
solution_terms(vs) # selection order of the variables
We plot some statistics computed on the training data, such as the sum of log
predictive densities (ELPD) and root mean squared error (RMSE) as the function
of number of variables added. By default, the statistics are shown on absolute
scale, but with deltas = TRUE
the plot shows results relative to the full
model.
# plot predictive performance on training data
plot(vs, stats = c('elpd', 'rmse'))
We perform the projection for a submodel of desired size using the function
project
. The projection can also be coerced to a matrix with draws of the
selected variables and sigma. The draws can be visualized with, for example, the
mcmc_areas
function in the bayesplot
package. Below we compare how the
projection affects the most relevant variables.
# Visualise the projected two most relevant variables
proj <- project(vs, nterms = 2, ndraws = 10)
mcmc_areas(as.matrix(proj), pars = solution_terms(vs)[1:2])
As we did in the Gaussian example, we make predictions with the projected submodels. The following computes the mean of the predictive distribution and evaluates the log density at the training points using the previously projected model.
pred <- proj_linpred(proj, newdata = data_pois, integrated = TRUE)
We can also visualize the corresponding projected predictions.
xaxis <- seq(-3, 3, length.out = 1000)
y_mu <- rowMeans(vs$refmodel$mu)
ggplot() +
geom_point(aes(x = pred$pred, y = y_mu)) +
geom_line(aes(x=xaxis, y = exp(xaxis)), color = "red") +
labs(x = "prediction", y = "y")
For both of the previous cases, running variable selection with the full data was actually enough because they were pretty simple. In this section we work on a slightly more difficult data set.
We can load the data from the popular lme4
package:
data("VerbAgg", package = "lme4")
The whole dataset consists of 7584 questionnaire answers from different
subjects. Given that the full data could take a long time to fit, we will
subsample 50 individuals (for which sampling still takes a bit). For this, we
use the great tidyverse
environment.
## subsample 50 participants
VerbAgg_subsample <- VerbAgg %>%
tidyr::as_tibble() %>%
dplyr::filter(id %in% sample(id, 50)) %>%
dplyr::mutate(r2num = as.integer(r2) - 1) # binomial family needs numeric target
For this simple model we will add some group effects that we know are not quite relevant.
## simple bernoulli model
formula_va <- r2num ~ btype + situ + mode + (btype + situ + mode | id)
fit_va <- stan_glmer(
formula = formula_va,
data = VerbAgg_subsample,
family = binomial("logit"),
seed = 1234,
chains = 2
)
As we did before, we can run the standard variable selection with
vs_va <- varsel(fit_va)
solution_terms(vs_va)
plot(vs_va, stats = c("elpd", "acc"))
Even though the ordering of the variables makes sense, the performance of the projected models does not quite match the reference model. There may be different reasons that can explain this behaviour:
varsel
with the default ndraws_pred
could be not enough. Increasing
ndraws_pred
helps but also increases the computational cost. Re fitting the
reference model ensuring a narrower posterior (usually employing a stronger
sparsifying prior) would have a similar effect. This is usually the case when
the difference in predictive performance is not very large.ndraws_pred
should also help.varsel
computes the expected predictive performance of the
reference model from the in-sample mean, it may sometimes result in
overconfident results. Running cv_varsel
with LOO cross-validation computes
the predictive performance by taking PSIS-LOO weights into account, usually
reporting a more realistic ELPD.By default, cv_varsel
will run LOO cross-validation, and will try to run a
full variable selection for every data point. Given the large computational cost
of running more draws, we'll first try running cv_varsel
without validating
the search. We can omit this behaviour by fitting the model once and computing
the LOO elpd posterior from it by passing the argument validate_search =
FALSE
. Note that in general it is recommended to run the full validation
search.
cv_vs_va <- cv_varsel(fit_va, validate_search = FALSE)
plot(cv_vs_va, stats = c("elpd", "acc"))
Now we see that the projected models behave as expected and can proceed with the usual analysis.
As in previous examples, we can show the mean prediction with proj_linpred
.
The following computes the mean of the predictive distribution and evaluates the
log density at the training points using the 6 most relevant variables.
pred <- proj_linpred(cv_vs_va, newdata = VerbAgg_subsample,
nterms = 6, integrated = TRUE, ndraws = 10)
We can also visualize the corresponding projected predictions.
xaxis <- seq(-6, 6, length.out = 1000)
yaxis <- cv_vs_va$family$linkinv(xaxis)
y_mu <- rowMeans(cv_vs_va$refmodel$mu)
ggplot() +
geom_point(aes(x = pred$pred, y = y_mu)) +
geom_line(aes(x = xaxis, y = yaxis), color = "red") +
labs(x = "prediction", y = "y")