This vignette shows how to use `projpred`

to perform variable selection in the
context of Generalized Linear Multilevel Models (GLMMs). For a general overview
of the package we refer the reader to the quickstart vignette. The method used
here is described in the latest preprint (by Catalina et al. (2020), arxiv
link).

Load the necessary packages. If the sampling takes more than 30 seconds and
multiple cores are available, uncomment the line setting `mc.cores`

to set the
number of cores used (this is commented out as the sampling in the example is
fast and to avoid possible problems when building the vignette along the package
installation in special environments such as computing clusters).

```
library(projpred)
library(rstanarm)
library(tidyr)
library(dplyr)
library(ggplot2)
library(bayesplot)
theme_set(theme_classic())
#options(mc.cores = 4)
```

For the first Gaussian example we borrow the popular `orthodont`

dataset from
the `nlme`

package. This data include observations from a growth curve on an
orthodontic measurement. It contains 108 observations with 4 variables. The
variables contained in this dataset are:

- distance, numeric distances from the pituitary to the pterygomaxillary fissure (mm). These are measured on x-ray images of the skull.
- age, the age of the subject (years).
- Subject, an ordered factor indicating the subject on which the measurement is taken.
- Sex, indicator of sex in the subject.

```
data("Orthodont", package = "nlme")
```

We first build the following complete `rstanarm`

model including all variables:

```
fit <- stan_glmer(distance ~ age * Sex + (age | Subject),
chains = 2, data = Orthodont, seed = 1)
```

Since we have repeated measurements over subjects at multiple time points, the
effect of `age`

is allowed to vary over subjects. To run `projpred`

on this
model we can first run a simple variable selection with the full dataset. This
approach may be overconfident in some cases and therefore in general cases it is
recommended to run a cross validated variable selection instead.

```
ref <- get_refmodel(fit)
vs <- varsel(ref)
```

Because our model includes group effects, the only available search method is forward search, which is a bit slower than L_1 search but tends to find smaller models. Beware that the underlying projection fitting method could return some warnings in cases where the optimization may not have converged properly. This is more common in the case of generalized models (non-Gaussian families). We print the list of the variables ordered by relevance:

```
solution_terms(vs) # selection order of the variables
```

We plot some statistics computed on the training data, such as the sum of log
predictive densities (ELPD) and root mean squared error (RMSE) as the function
of number of variables added. By default, the statistics are shown on absolute
scale, but with `deltas = TRUE`

the plot shows results relative to the full
model.

```
# plot predictive performance on training data
plot(vs, stats = c('elpd', 'rmse'))
```

From this plot, it is clearly visible that the first two terms are probaly
enough to achieve predictive performance comparable to the reference model. We
perform the projection for a submodel of desired size using the function
`project`

. The projection can also be coerced to a matrix with draws of the
selected variables and sigma. The draws can be visualized with, for example, the
`mcmc_areas`

function in the `bayesplot`

package. Below we compare how the
projection affects the three most relevant variables.

```
# Visualise the projected three most relevant variables
proj <- project(vs, nterms = 2, ns = 500)
mcmc_areas(as.matrix(proj), pars = solution_terms(vs)[1:2])
```

We make predictions with the projected submodels. For point estimates we can use
method `proj_linpred`

. Test inputs can be provided using the keyword `newdata`

.
If also the test targets `ynew`

are provided, then the function evaluates the
log predictive density at these points. For instance, the following computes the
mean of the predictive distribution and evaluates the log density at the
training points using the 5 most relevant variables.

```
pred <- proj_linpred(vs, newdata = Orthodont, nterms = 5, integrated = TRUE)
```

Visualize the predictions

```
ggplot() +
geom_point(aes(x = pred$pred, y = Orthodont$distance)) +
geom_abline(slope = 1, color = "red") +
labs(x = "prediction", y = "y")
```

We also obtain draws from the projected predictive distribution. Here's an example prediction for the first data point using 5 terms (the observed value is marked by the red line).

```
subset <- Orthodont %>% as_tibble() %>% dplyr::sample_n(1)
y_subset <- subset %>% dplyr::select(distance) %>% as.data.frame()
y1_rep <- proj_predict(vs, newdata = subset, nterms = 5, seed = 7560)
qplot(as.vector(y1_rep), bins = 25) +
geom_vline(xintercept = as.numeric(y_subset), color = "red") +
xlab("y1_rep")
```

As previously, we first load the data,

```
data_pois <- read.table("data_pois.csv", header = TRUE)
```

These data correspond to a phylogenetic dataset where the phenotype is expressed
as counts. This kind of data is relevant in evolutionary biology when data of
many species are analyzed at the same time. The model we fit here is borrowed
from *Modern Phylogenetic Comparative Methods and the application in
Evolutionary Biology* (de Villemeruil & Nakagawa, 2014). The necessary data can
also be downloaded from the corresponding website
(http://www.mpcm-evolution.com/). The specific formula is taken from the
*Estimating Phylogenetic Multilevel Models with brms* vignette of the brms
package
(https://cran.r-project.org/package=brms/vignettes/brms_phylogenetics.html).

```
fit <- stan_glmer(
phen_pois ~ cofactor + (1 | phylo) + (1 | obs), data = data_pois,
family = poisson("log"), chains = 2, iter = 2000,
control = list(adapt_delta = 0.95)
)
```

As we did in the previous example, we can perform variable selection and look at the projection.

```
vs <- varsel(fit)
```

Because our model includes group effects, the only available search method is forward search, which is a bit slower than L_1 search but tends to find smaller models. Beware that the underlying projection fitting method could return some warnings in cases where the optimization may not have converged properly. This is more common in the case of generalized models (non-Gaussian families). We print the list of the variables ordered by relevance:

```
solution_terms(vs) # selection order of the variables
```

We plot some statistics computed on the training data, such as the sum of log
predictive densities (ELPD) and root mean squared error (RMSE) as the function
of number of variables added. By default, the statistics are shown on absolute
scale, but with `deltas = TRUE`

the plot shows results relative to the full
model.

```
# plot predictive performance on training data
plot(vs, stats = c('elpd', 'rmse'))
```

We perform the projection for a submodel of desired size using the function
`project`

. The projection can also be coerced to a matrix with draws of the
selected variables and sigma. The draws can be visualized with, for example, the
`mcmc_areas`

function in the `bayesplot`

package. Below we compare how the
projection affects the most relevant variables.

```
# Visualise the projected two most relevant variables
proj <- project(vs, nterms = 2, ndraws = 10)
mcmc_areas(as.matrix(proj), pars = solution_terms(vs)[1:2])
```

As we did in the Gaussian example, we make predictions with the projected submodels. The following computes the mean of the predictive distribution and evaluates the log density at the training points using the previously projected model.

```
pred <- proj_linpred(proj, newdata = data_pois, integrated = TRUE)
```

We can also visualize the corresponding projected predictions.

```
xaxis <- seq(-3, 3, length.out = 1000)
y_mu <- rowMeans(vs$refmodel$mu)
ggplot() +
geom_point(aes(x = pred$pred, y = y_mu)) +
geom_line(aes(x=xaxis, y = exp(xaxis)), color = "red") +
labs(x = "prediction", y = "y")
```

For both of the previous cases, running variable selection with the full data was actually enough because they were pretty simple. In this section we work on a slightly more difficult data set.

We can load the data from the popular `lme4`

package:

```
data("VerbAgg", package = "lme4")
```

The whole dataset consists of 7584 questionnaire answers from different
subjects. Given that the full data could take a long time to fit, we will
subsample 50 individuals (for which sampling still takes a bit). For this, we
use the great `tidyverse`

environment.

```
## subsample 50 participants
VerbAgg_subsample <- VerbAgg %>%
tidyr::as_tibble() %>%
dplyr::filter(id %in% sample(id, 50)) %>%
dplyr::mutate(r2num = as.integer(r2) - 1) # binomial family needs numeric target
```

For this simple model we will add some group effects that we know are not quite relevant.

```
## simple bernoulli model
formula_va <- r2num ~ btype + situ + mode + (btype + situ + mode | id)
fit_va <- stan_glmer(
formula = formula_va,
data = VerbAgg_subsample,
family = binomial("logit"),
seed = 1234,
chains = 2
)
```

As we did before, we can run the standard variable selection with

```
vs_va <- varsel(fit_va)
```

```
solution_terms(vs_va)
```

```
plot(vs_va, stats = c("elpd", "acc"))
```

Even though the ordering of the variables makes sense, the performance of the projected models does not quite match the reference model. There may be different reasons that can explain this behaviour:

- The reference model posterior may not be very narrow and then running
`varsel`

with the default`ndraws_pred`

could be not enough. Increasing`ndraws_pred`

helps but also increases the computational cost. Re fitting the reference model ensuring a narrower posterior (usually employing a stronger sparsifying prior) would have a similar effect. This is usually the case when the difference in predictive performance is not very large. - For non-Gaussian models the source of the inaccuracy may come from the fact
that we are running an approximate projection. In this case, increasing the
number of draws for the prediction
`ndraws_pred`

should also help. - Given that
`varsel`

computes the expected predictive performance of the reference model from the in-sample mean, it may sometimes result in overconfident results. Running`cv_varsel`

with LOO cross-validation computes the predictive performance by taking PSIS-LOO weights into account, usually reporting a more realistic ELPD.

By default, `cv_varsel`

will run LOO cross-validation, and will try to run a
full variable selection for every data point. Given the large computational cost
of running more draws, we'll first try running `cv_varsel`

without validating
the search. We can omit this behaviour by fitting the model once and computing
the LOO elpd posterior from it by passing the argument ```
validate_search =
FALSE
```

. Note that in general it is recommended to run the full validation
search.

```
cv_vs_va <- cv_varsel(fit_va, validate_search = FALSE)
```

```
plot(cv_vs_va, stats = c("elpd", "acc"))
```

Now we see that the projected models behave as expected and can proceed with the usual analysis.

As in previous examples, we can show the mean prediction with `proj_linpred`

.
The following computes the mean of the predictive distribution and evaluates the
log density at the training points using the 6 most relevant variables.

```
pred <- proj_linpred(cv_vs_va, newdata = VerbAgg_subsample,
nterms = 6, integrated = TRUE, ndraws = 10)
```

We can also visualize the corresponding projected predictions.

```
xaxis <- seq(-6, 6, length.out = 1000)
yaxis <- cv_vs_va$family$linkinv(xaxis)
y_mu <- rowMeans(cv_vs_va$refmodel$mu)
ggplot() +
geom_point(aes(x = pred$pred, y = y_mu)) +
geom_line(aes(x = xaxis, y = yaxis), color = "red") +
labs(x = "prediction", y = "y")
```