Pironen and Vehtari (*Statistical Computing* 27:711-735; 2017) present a very comprehensive overview of methods for model selection in a Bayesian context. I encourage you to read the whole thing, but this notebook is going to introduce only the best performing of the methods they review: projection predictive feature selection. Let’s unpack what that means, starting with “feature selection.”

Feature selection is just another name for “variable selection.” What’s different here from what we saw with horseshoe priors is that rather than simply looking at the posterior estimates and saying to ourselves “These variables (features) seem important and thoee don’t”, projection predictive feature selection provides a statistical criterion to help us identify the important variables (features).

What about the “projection prediction” part? Well, that’s a bit more complicated. You’ll need to read Pironen and Vehtari to get all of the details, but here it is in a nutshell.

We start with a full model that (we think) includes all of the covariates that could be relevant in predicting the response. Now imagine that we examine reduced models that don’t include all of the covariates, * but* instead of fitting those reduced models to the data and comparing their performance in predicting the data we fit them to data

While it may seem strange to pick covariates based on how well models perform on * predicted* rather than

`projpred`

that makes it easy to use the approach.As usual, we have to start by generating the data.

```
library(tidyverse)
library(reshape2)
library(ggplot2)
library(cowplot)
library(mvtnorm)
library(corrplot)
rm(list = ls())
```

```
## intetcept
##
beta0 <- 1.0
## regression coefficients
##
beta <- c(1.0, -1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
## pattern of correlation matrix, all non-zero entries are set to saem
## correlation, covariance matrix caldulated from individual variances and a
## single association parameter governing the non-zero correlation coefficients
##
## Note: Not just any pattern will work here. The correlation matrix and
## covariance matrix generated from this pattern must be positive definite.
## If you change this pattern, you may get an error when you try to generate
## data with a non-zero association parameter.
##
Rho <- matrix(nrow = 9, ncol = , byrow = TRUE,
data = c(1,0,1,0,1,0,1,0,1,
0,1,0,1,0,1,0,1,0,
1,0,1,0,1,0,1,0,1,
0,1,0,1,0,1,0,1,0,
1,0,1,0,1,0,1,0,1,
0,1,0,1,0,1,0,1,0,
1,0,1,0,1,0,1,0,1,
0,1,0,1,0,1,0,1,0,
1,0,1,0,1,0,1,0,1
))
## vector of standard deviations for covariates
##
sigma <- rep(1, 9)
## construct a covariance matrix from the pattern, standard deviations, and
## one parameter in [-1,1] that governs the magnitude of non-zero correlation
## coefficients
##
## Rho - the pattern of associations
## sigma - the vector of standard deviations
## rho - the association parameter
##
construct_Sigma <- function(Rho, sigma, rho) {
## get the correlation matris
##
Rho <- Rho*rho
for (i in 1:ncol(Rho)) {
Rho[i,i] <- 1.0
}
## notice the use of matrix multiplication
##
Sigma <- diag(sigma) %*% Rho %*% diag(sigma)
return(Sigma)
}
## set the random number seed manually so that every run of the code will
## produce the same numbers
##
set.seed(1234)
n_samp <- 100
cov_str <- rmvnorm(n_samp,
mean = rep(0, nrow(Rho)),
sigma = construct_Sigma(Rho, sigma, 0.8))
resid <- rep(2.0, n_samp)
y_str <- rnorm(nrow(cov_str), mean = beta0 + cov_str %*% beta, sd = resid)
dat_1 <- data.frame(y_str, cov_str, rep("Strong", length(y_str)))
cov_str <- rmvnorm(n_samp,
mean = rep(0, nrow(Rho)),
sigma = construct_Sigma(Rho, sigma, 0.8))
y_str <- rnorm(nrow(cov_str), mean = beta0 + cov_str %*% beta, sd = resid)
dat_2 <- data.frame(y_str, cov_str, rep("Strong", length(y_str)))
column_names <- c("y", paste("x", seq(1, length(beta)), sep = ""), "Scenario")
colnames(dat_1) <- column_names
colnames(dat_2) <- column_names
## saving results in scale allows me to use them later for prediction with
## new data
##
scale_1 <- lapply(dat_1[, 1:10], scale)
scale_2 <- lapply(dat_2[, 1:10], scale)
## when assigning the same scaling to a data frame, the scaling attributes
## are lost
##
dat_1[, 1:10] <- lapply(dat_1[, 1:10], scale)
dat_2[, 1:10] <- lapply(dat_2[, 1:10], scale)
```

Since we’ll be running the projection predictive analysis on both data sets, I’ve written a small function, `projection_prediction()`

that will run everything for us and collect the results. Notice that we start by fitting the full model with horseshoe priors.

```
library(rstanarm)
library(projpred)
library(bayesplot)
options(mc.cores = parallel::detectCores())
projection_prediction <- function(dat) {
n <- nrow(dat)
D <- ncol(dat[,2:10])
p0 <- 3
tau0 <- p0/(D - p0) * 1/sqrt(n)
prior_coeff <- hs(global_scale = tau0, slab_scale = 1)
fit <- stan_glm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9, data = dat,
prior = prior_coeff, refresh = 0)
vs <- varsel(fit, method = "forward")
cvs <- cv_varsel(fit, method = "forward", verbose = FALSE)
proj_size <- suggest_size(cvs)
proj <- project(vs, nv = proj_size, ns = 2000)
mcmc_intervals(as.matrix(proj),
pars = c("(Intercept)", names(vs$vind[1:proj_size]), "sigma"))
pred <- proj_linpred(vs, xnew = dat[, 2:10], ynew = dat$y,
nv = proj_size, integrated = TRUE)
for.plot <- data.frame(Observed = dat$y,
Predicted = pred$pred)
p <- ggplot(for.plot, aes(x = Observed, y = Predicted)) +
geom_point() +
geom_smooth(method = "lm") +
geom_abline(slope = 1, intercept = 0) +
labs(x = "Observed", y = "Predicted")
print(p)
return(list(proj = proj, vs = vs, cvs = cvs, proj_size = proj_size))
}
summarize_posterior <- function(x, credible = 0.95, digits = 3) {
lo_p <- (1.0 - credible)/2.0
hi_p <- credible + lo_p
ci <- quantile(x, c(lo_p, hi_p))
results <- sprintf("% 5.3f (% 5.3f, % 5.3f)\n", mean(x), ci[1], ci[2])
cat(results, sep = "")
}
summarize_results <- function(x, credible = 0.95, digits = 3) {
vars <- colnames(as.matrix(x))
for (i in 1:length(vars)) {
label <- sprintf("%11s: ", vars[i])
cat(label, sep = "")
summarize_posterior(as.matrix(x)[,i])
}
}
proj_1 <- projection_prediction(dat_1)
```

`proj_2 <- projection_prediction(dat_2)`

`summarize_results(proj_1$proj)`

```
(Intercept): 0.003 (-0.157, 0.154)
x3: 0.346 ( 0.075, 0.595)
x1: 0.252 ( 0.010, 0.521)
sigma: 0.833 ( 0.727, 0.973)
```

`summarize_results(proj_2$proj)`

```
(Intercept): -0.001 (-0.148, 0.149)
x3: 0.578 ( 0.421, 0.726)
x6: -0.401 (-0.547, -0.256)
sigma: 0.759 ( 0.655, 0.884)
```

Here we have results that are quite different from one another. Not only is the coefficient on `x3`

quite different in the two data sets,^{1} but the variables identified as important are different: `x3`

and `x1`

in analysis of the first data set and `x3`

and `x6`

in the second data set. That isn’t quite as disturbing as it might first appear. Look back at the results from our regression with horseshoe priors on all of the data. You’ll see that `x1`

and `x3`

had a greater magnitude in analysis of the first data set than any other covariates. Similarly, `x3`

and `x6`

were the most important in analysis of the second data set.

In the code I ran above I used `suggest_size()`

to pick the number of covariates to use. If you read the documentation for `suggest_size()`

^{2} you’ll notice this sentence in the Description:

It is recommended that the user studies the results via

`varsel_plot`

and/or`varsel_stats`

and makes the final decision based on what is most appropriate for the given problem.

Let’s take a look at the results from `varsel_plot()`

first.

`print(varsel_plot(proj_1$cvs, stats = c("elpd", "rmse"), deltas = TRUE))`