Overview

Pironen and Vehtari (Statistical Computing 27:711-735; 2017) present a very comprehensive overview of methods for model selection in a Bayesian context. I encourage you to read the whole thing, but this notebook is going to introduce only the best performing of the methods they review: projection predictive feature selection. Let’s unpack what that means, starting with “feature selection.”

Feature selection is just another name for “variable selection.” What’s different here from what we saw with horseshoe priors is that rather than simply looking at the posterior estimates and saying to ourselves “These variables (features) seem important and thoee don’t”, projection predictive feature selection provides a statistical criterion to help us identify the important variables (features).

What about the “projection prediction” part? Well, that’s a bit more complicated. You’ll need to read Pironen and Vehtari to get all of the details, but here it is in a nutshell.

We start with a full model that (we think) includes all of the covariates that could be relevant in predicting the response. Now imagine that we examine reduced models that don’t include all of the covariates, but instead of fitting those reduced models to the data and comparing their performance in predicting the data we fit them to data predicted from the full model, i.e., its posterior predictions, and find the reduced model that does the “best job” of matching the posterior predictions. By “best job” we mean that we pick the model with the smallest number of covariates we can while keeping the predictive performance of the reduced model about the same as the full model.

While it may seem strange to pick covariates based on how well models perform on predicted rather than observed data, Pironen and Vehtari show that this approach works very well when compared with other alternatives that have been suggested. And they’ve put together a very nice package, projpred that makes it easy to use the approach.

Setting up the data

As usual, we have to start by generating the data.

library(tidyverse)
library(reshape2)
library(ggplot2)
library(cowplot)
library(mvtnorm)
library(corrplot)

rm(list = ls())
## intetcept
##
beta0 <- 1.0
## regression coefficients
##
beta <- c(1.0, -1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
## pattern of correlation matrix, all non-zero entries are set to saem
## correlation, covariance matrix caldulated from individual variances and a 
## single association parameter governing the non-zero correlation coefficients
##
## Note: Not just any pattern will work here. The correlation matrix and
## covariance matrix generated from this pattern must be positive definite.
## If you change this pattern, you may get an error when you try to generate
## data with a non-zero association parameter.
##
Rho <- matrix(nrow = 9, ncol = , byrow = TRUE, 
              data = c(1,0,1,0,1,0,1,0,1,
                       0,1,0,1,0,1,0,1,0,
                       1,0,1,0,1,0,1,0,1,
                       0,1,0,1,0,1,0,1,0,
                       1,0,1,0,1,0,1,0,1,
                       0,1,0,1,0,1,0,1,0,
                       1,0,1,0,1,0,1,0,1,
                       0,1,0,1,0,1,0,1,0,
                       1,0,1,0,1,0,1,0,1
                       ))
## vector of standard deviations for covariates
##
sigma <- rep(1, 9)

## construct a covariance matrix from the pattern, standard deviations, and
## one parameter in [-1,1] that governs the magnitude of non-zero correlation
## coefficients
##
## Rho - the pattern of associations
## sigma - the vector of standard deviations
## rho - the association parameter
##
construct_Sigma <- function(Rho, sigma, rho) {
  ## get the correlation matris
  ##
  Rho <- Rho*rho
  for (i in 1:ncol(Rho)) {
    Rho[i,i] <- 1.0
  }
  ## notice the use of matrix multiplication
  ##
  Sigma <- diag(sigma) %*% Rho %*% diag(sigma)
  return(Sigma)
}

## set the random number seed manually so that every run of the code will 
## produce the same numbers
##
set.seed(1234)

n_samp <- 100
cov_str <- rmvnorm(n_samp,
                   mean = rep(0, nrow(Rho)),
                   sigma = construct_Sigma(Rho, sigma, 0.8))

resid <- rep(2.0, n_samp)

y_str <- rnorm(nrow(cov_str), mean = beta0 + cov_str %*% beta, sd = resid)
dat_1 <- data.frame(y_str, cov_str, rep("Strong", length(y_str)))

cov_str <- rmvnorm(n_samp,
                   mean = rep(0, nrow(Rho)),
                   sigma = construct_Sigma(Rho, sigma, 0.8))
y_str <- rnorm(nrow(cov_str), mean = beta0 + cov_str %*% beta, sd = resid)
dat_2 <- data.frame(y_str, cov_str, rep("Strong", length(y_str)))

column_names <- c("y", paste("x", seq(1, length(beta)), sep = ""), "Scenario")
colnames(dat_1) <- column_names
colnames(dat_2) <- column_names

## saving results in scale allows me to use them later for prediction with
## new data
##
scale_1 <- lapply(dat_1[, 1:10], scale)
scale_2 <- lapply(dat_2[, 1:10], scale)

## when assigning the same scaling to a data frame, the scaling attributes
## are lost
##
dat_1[, 1:10] <- lapply(dat_1[, 1:10], scale)
dat_2[, 1:10] <- lapply(dat_2[, 1:10], scale)

Trying projection

Since we’ll be running the projection predictive analysis on both data sets, I’ve written a small function, projection_prediction() that will run everything for us and collect the results. Notice that we start by fitting the full model with horseshoe priors.

library(rstanarm)
library(projpred)
library(bayesplot)

options(mc.cores = parallel::detectCores())

projection_prediction <- function(dat) {
  n <- nrow(dat)
  D <- ncol(dat[,2:10])
  p0 <- 3
  tau0 <- p0/(D - p0) * 1/sqrt(n)
  prior_coeff <- hs(global_scale = tau0, slab_scale = 1)
  fit <- stan_glm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9, data = dat, 
                  prior = prior_coeff, refresh = 0)

  vs <- varsel(fit, method = "forward")
  cvs <- cv_varsel(fit, method = "forward", verbose = FALSE)

  proj_size <- suggest_size(cvs)
  proj <- project(vs, nv = proj_size, ns = 2000)
  mcmc_intervals(as.matrix(proj),
                 pars = c("(Intercept)", names(vs$vind[1:proj_size]), "sigma"))

  pred <- proj_linpred(vs, xnew = dat[, 2:10], ynew = dat$y, 
                       nv = proj_size, integrated = TRUE)
  for.plot <- data.frame(Observed = dat$y,
                         Predicted = pred$pred)
  p <- ggplot(for.plot, aes(x = Observed, y = Predicted)) + 
    geom_point() + 
    geom_smooth(method = "lm") +
    geom_abline(slope = 1, intercept = 0) +
    labs(x = "Observed", y = "Predicted")
  print(p)

  return(list(proj = proj, vs = vs, cvs = cvs, proj_size = proj_size))
}

summarize_posterior <- function(x, credible = 0.95, digits = 3) {
  lo_p <- (1.0 - credible)/2.0
  hi_p <- credible + lo_p
  ci <- quantile(x, c(lo_p, hi_p))
  results <- sprintf("% 5.3f (% 5.3f, % 5.3f)\n", mean(x), ci[1], ci[2])
  cat(results, sep = "")
}

summarize_results <- function(x, credible = 0.95, digits = 3) {
  vars <- colnames(as.matrix(x))
  for (i in 1:length(vars)) {
    label <- sprintf("%11s: ", vars[i])
    cat(label, sep = "")
    summarize_posterior(as.matrix(x)[,i])
  }
}

proj_1 <- projection_prediction(dat_1)

proj_2 <- projection_prediction(dat_2)

summarize_results(proj_1$proj)
(Intercept):  0.003 (-0.157,  0.154)
         x3:  0.346 ( 0.075,  0.595)
         x1:  0.252 ( 0.010,  0.521)
      sigma:  0.833 ( 0.727,  0.973)
summarize_results(proj_2$proj)
(Intercept): -0.001 (-0.148,  0.149)
         x3:  0.578 ( 0.421,  0.726)
         x6: -0.401 (-0.547, -0.256)
      sigma:  0.759 ( 0.655,  0.884)

Here we have results that are quite different from one another. Not only is the coefficient on x3 quite different in the two data sets,1 but the variables identified as important are different: x3 and x1 in analysis of the first data set and x3 and x6 in the second data set. That isn’t quite as disturbing as it might first appear. Look back at the results from our regression with horseshoe priors on all of the data. You’ll see that x1 and x3 had a greater magnitude in analysis of the first data set than any other covariates. Similarly, x3 and x6 were the most important in analysis of the second data set.

Picking the number of covariates

In the code I ran above I used suggest_size() to pick the number of covariates to use. If you read the documentation for suggest_size()2 you’ll notice this sentence in the Description:

It is recommended that the user studies the results via varsel_plot and/or varsel_stats and makes the final decision based on what is most appropriate for the given problem.

Let’s take a look at the results from varsel_plot() first.

print(varsel_plot(proj_1$cvs, stats = c("elpd", "rmse"), deltas = TRUE))