It won’t come as a surprise to anyone who knows me that I have to try a Bayesian approach to variable selection. It also turns out that there’s an easy way to do it, since “horseshoe priors” are built into rstanarm. What’s a horeshoe prior you ask? Well you can see a few of the details in the next section, or you can just skip to the section where we use it in rstanarm if you find the math more confusing than helpful.
If you really want to know and understand horseshoe priors, you’ll need to read a paper by Juho Pironen and Aki Vehtari in the Electronic Journal of Statistics,1 but here’s a brief outline.
The likelihood in standard linear regression model looks like this
\[ \eqalign{ y_i &\sim& \mbox{N}(\mu_i, \sigma^2) \\ \mu_i &=& \beta_0 + \sum_j \beta_jx_{ij} } \]
To complete the Bayesian model, we need to specify priors for \(\sigma^2\) and the \(\beta\)s. The default choices in rstanarm are:2
\[ \eqalign{ \sigma &\sim& \mbox{Exp}(1) \\ \beta_0 &\sim& \mbox{N}(0, 10) \\ \beta_i &\sim& \mbox{N}(0, 2.5) \quad \mbox{for } i > 0 } \]
These are the choices we used here. The basic horsehoe prior affects only the last of these. Specifically,
\[ \eqalign{ \beta_i &\sim& \mbox{N}(0, \tau^2\lambda_i^2) \\ \lambda_i &\sim& \mbox{Cauchy}^+(0, 1) \quad , } \]
where \(\mbox{Cauchy}^+\) refers to a half-Cauchy distribution on the positive real line. A Cauchy distribution has very fat tails, so the tendency for any \(\beta_i\) is to be either close to zero, because \(\lambda_i\) is small, or well away from zero, because \(\lambda_i\) is large. \(\tau\) sets the total amount of influence the covariates have on the response, rather like \(\lambda\) in the Lasso.
As Pironen and Vehtari explain, however, there hasn’t been consensus on how to set or estimate \(\tau\). They introduce a new version of the horseshoe prior, the regularized horseshoe prior that looks like this
\[ \eqalign{ \beta_i &\sim& \mbox{N}(0, \tau^2\tilde\lambda_i^2) \\ \tilde\lambda_i &=& \frac{c^2\lambda_i^2}{c^2 + \tau^2\lambda_i^2} \\ \lambda_i &\sim& \mbox{Cauchy}^+(0, 1) \quad . } \]
Notice that if \(\tau^2\lambda_i^2 \ll c^2\), meaning that \(\beta_i\) is close to 0, then we have something very close to the original horseshoe prior. If on the other hand, \(\tau^2\lambda_i^2 \gg c^2\) then the prior on \(\beta_i\) is close to \(\mbox{N}(0, c^2)\). Now we just3 need a prior on \(c\). Read Pironen and Vehtari if you want all of the gory details. We’ll just use hs() as a prior in our model, setting the global_scale parameter according to their advice.
Once again we have to start by generating the data.
library(tidyverse)
library(reshape2)
library(ggplot2)
library(cowplot)
library(mvtnorm)
library(corrplot)
rm(list = ls())## intetcept
##
beta0 <- 1.0
## regression coefficients
##
beta <- c(1.0, -1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
## pattern of correlation matrix, all non-zero entries are set to saem
## correlation, covariance matrix caldulated from individual variances and a 
## single association parameter governing the non-zero correlation coefficients
##
## Note: Not just any pattern will work here. The correlation matrix and
## covariance matrix generated from this pattern must be positive definite.
## If you change this pattern, you may get an error when you try to generate
## data with a non-zero association parameter.
##
Rho <- matrix(nrow = 9, ncol = , byrow = TRUE, 
              data = c(1,0,1,0,1,0,1,0,1,
                       0,1,0,1,0,1,0,1,0,
                       1,0,1,0,1,0,1,0,1,
                       0,1,0,1,0,1,0,1,0,
                       1,0,1,0,1,0,1,0,1,
                       0,1,0,1,0,1,0,1,0,
                       1,0,1,0,1,0,1,0,1,
                       0,1,0,1,0,1,0,1,0,
                       1,0,1,0,1,0,1,0,1
                       ))
## vector of standard deviations for covariates
##
sigma <- rep(1, 9)
## construct a covariance matrix from the pattern, standard deviations, and
## one parameter in [-1,1] that governs the magnitude of non-zero correlation
## coefficients
##
## Rho - the pattern of associations
## sigma - the vector of standard deviations
## rho - the association parameter
##
construct_Sigma <- function(Rho, sigma, rho) {
  ## get the correlation matris
  ##
  Rho <- Rho*rho
  for (i in 1:ncol(Rho)) {
    Rho[i,i] <- 1.0
  }
  ## notice the use of matrix multiplication
  ##
  Sigma <- diag(sigma) %*% Rho %*% diag(sigma)
  return(Sigma)
}
## set the random number seed manually so that every run of the code will 
## produce the same numbers
##
set.seed(1234)
n_samp <- 100
cov_str <- rmvnorm(n_samp,
                   mean = rep(0, nrow(Rho)),
                   sigma = construct_Sigma(Rho, sigma, 0.8))
resid <- rep(2.0, n_samp)
y_str <- rnorm(nrow(cov_str), mean = beta0 + cov_str %*% beta, sd = resid)
dat_1 <- data.frame(y_str, cov_str, rep("Strong", length(y_str)))
cov_str <- rmvnorm(n_samp,
                   mean = rep(0, nrow(Rho)),
                   sigma = construct_Sigma(Rho, sigma, 0.8))
y_str <- rnorm(nrow(cov_str), mean = beta0 + cov_str %*% beta, sd = resid)
dat_2 <- data.frame(y_str, cov_str, rep("Strong", length(y_str)))
column_names <- c("y", paste("x", seq(1, length(beta)), sep = ""), "Scenario")
colnames(dat_1) <- column_names
colnames(dat_2) <- column_names
## saving results in scale allows me to use them later for prediction with
## new data
##
scale_1 <- lapply(dat_1[, 1:10], scale)
scale_2 <- lapply(dat_2[, 1:10], scale)
## when assigning the same scaling to a data frame, the scaling attributes
## are lost
##
dat_1[, 1:10] <- lapply(dat_1[, 1:10], scale)
dat_2[, 1:10] <- lapply(dat_2[, 1:10], scale)Now that we have the data, let’s try the horshoe prior.
library(rstanarm)
options(mc.cores = parallel::detectCores())
set.seed(1234)
## n is the number of observations
## D is the number of covariates
## p0 is the expected number of important covariates
##
n <- nrow(dat_1)
D <- ncol(dat_1[,2:10])
p0 <- 3
tau0 <- p0/(D - p0) * 1/sqrt(n)
prior_coeff <- hs(global_scale = tau0, slab_scale = 1)
fit_1 <- stan_glm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9,
                  data = dat_1,
                  prior = prior_coeff,
                  refresh = 0,
                  adapt_delta = 0.999)
## Note: I use the same prior scale here because both data sets have the same number of observations
## and the same expected number of important covariates
##
fit_2 <- stan_glm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9,
                  data = dat_2,
                  prior = prior_coeff,
                  refresh = 0,
                  adapt_delta = 0.999)
predict_1 <- posterior_predict(fit_1)
predict_2 <- posterior_predict(fit_2)
for.plot <- data.frame(Observed = dat_1$y, Predicted = apply(predict_1, 2, mean))
p <- ggplot(for.plot, aes(x = Observed, y = Predicted)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0) +
  geom_smooth(method = "lm") +
  ggtitle("Data set 1")
print(p)
for.plot <- data.frame(Observed = dat_2$y, Predicted = apply(predict_2, 2, mean))
p <- ggplot(for.plot, aes(x = Observed, y = Predicted)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0) +
  geom_smooth(method = "lm") +
  ggtitle("Data set 2")
print(p)
cat("Results from data set 1\n")Results from data set 1summary(fit_1, digits = 3)
Model Info:
 function:     stan_glm
 family:       gaussian [identity]
 formula:      y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9
 algorithm:    sampling
 priors:       see help('prior_summary')
 sample:       4000 (posterior sample size)
 observations: 100
 predictors:   10
Estimates:
                mean     sd       2.5%     25%      50%      75%      97.5% 
(Intercept)      0.002    0.081   -0.159   -0.052    0.003    0.054    0.166
x1               0.202    0.153   -0.022    0.067    0.197    0.312    0.508
x2              -0.051    0.086   -0.267   -0.092   -0.023    0.002    0.073
x3               0.303    0.159    0.000    0.194    0.314    0.417    0.595
x4              -0.039    0.078   -0.234   -0.073   -0.017    0.003    0.086
x5               0.017    0.077   -0.131   -0.015    0.004    0.043    0.204
x6              -0.043    0.081   -0.248   -0.078   -0.018    0.003    0.083
x7              -0.006    0.075   -0.187   -0.027   -0.001    0.022    0.148
x8              -0.027    0.079   -0.223   -0.058   -0.009    0.008    0.119
x9               0.117    0.155   -0.066    0.002    0.060    0.202    0.514
sigma            0.808    0.060    0.700    0.766    0.804    0.845    0.937
mean_PPD         0.000    0.113   -0.229   -0.075    0.001    0.075    0.219
log-posterior -177.435    4.780 -187.876 -180.399 -177.095 -174.074 -169.109
Diagnostics:
              mcse  Rhat  n_eff
(Intercept)   0.001 1.000 5339 
x1            0.003 1.000 2125 
x2            0.002 1.000 2634 
x3            0.003 1.000 2334 
x4            0.001 1.000 3196 
x5            0.001 1.000 4265 
x6            0.001 1.000 2960 
x7            0.001 0.999 4140 
x8            0.001 1.000 3831 
x9            0.004 1.001 1807 
sigma         0.001 1.000 3766 
mean_PPD      0.002 1.000 4337 
log-posterior 0.137 1.001 1211 
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).cat("\n\nResults from data set 2\n")
Results from data set 2summary(fit_2, digits = 3)
Model Info:
 function:     stan_glm
 family:       gaussian [identity]
 formula:      y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9
 algorithm:    sampling
 priors:       see help('prior_summary')
 sample:       4000 (posterior sample size)
 observations: 100
 predictors:   10
Estimates:
                mean     sd       2.5%     25%      50%      75%      97.5% 
(Intercept)      0.000    0.074   -0.145   -0.050    0.001    0.050    0.147
x1               0.118    0.120   -0.037    0.016    0.091    0.198    0.392
x2              -0.172    0.164   -0.511   -0.298   -0.142   -0.022    0.046
x3               0.467    0.135    0.186    0.386    0.474    0.560    0.709
x4               0.005    0.078   -0.163   -0.025    0.001    0.033    0.180
x5               0.018    0.073   -0.119   -0.014    0.005    0.045    0.197
x6              -0.259    0.172   -0.570   -0.390   -0.272   -0.116    0.019
x7               0.009    0.070   -0.131   -0.018    0.002    0.035    0.171
x8               0.010    0.076   -0.139   -0.019    0.002    0.037    0.192
x9              -0.011    0.072   -0.190   -0.035   -0.002    0.019    0.131
sigma            0.740    0.056    0.641    0.700    0.735    0.777    0.860
mean_PPD         0.000    0.108   -0.211   -0.073   -0.001    0.072    0.214
log-posterior -169.089    4.782 -179.092 -172.021 -168.731 -165.702 -160.714
Diagnostics:
              mcse  Rhat  n_eff
(Intercept)   0.001 1.000 4646 
x1            0.002 1.000 2371 
x2            0.004 1.002 1576 
x3            0.002 0.999 3194 
x4            0.001 0.999 4414 
x5            0.001 1.000 4767 
x6            0.004 1.002 1531 
x7            0.001 1.000 4473 
x8            0.001 0.999 4565 
x9            0.001 1.000 4384 
sigma         0.001 0.999 3974 
mean_PPD      0.002 1.000 4471 
log-posterior 0.141 1.002 1158 
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).There are several things I like about using regularized horeshoe priors in rstanarm rather than the Lasso.
I get an assessment of how reliable estimates of the regression coefficients are in addition to a point estimate of what they are. In this case you can see that the 95% credible intervals overlap 0 for every coefficient in analysis of the first data set, while only the credible intervals for x3 don’t overlap 0 in analysis of the second data set. Notice, however, that some of the intervals are very close to not overlapping 0, i.e., the intervals for x1, x3, and x9 in the analysis of data set 1 and the intervals for x1, x2, and x6 in the analysis of data set 2.
I can also plot the estimates and their uncertainty very easily. Plotting the estimates and their uncertainty makes is much easier to pick out the covariates that seem to have an association with the response variable.4 Note: The outer intervals in these plots correspond to 90% credible intervals, not 95% credible intervals. You could change that by specifying prob_outer = 0.95 in the call to plot().
p <- plot(fit_1) + ggtitle("Estimates from data set 1")
print(p)p <- plot(fit_2) + ggtitle("Estimates from data set 2")
print(p)stan_glmer(). There is a package to fit the Lasso to a generalized mixed model with lmer(), but I haven’t tried it, and it isn’t built into lmer().If you look back at earlier installments in this series, you’ll see that the point estimates we got here aren’t too different from what we’ve seen before, e.g., 0.185 for x1 in data set 1 here vs. 0.239 in data set 1 from the Lasso. Using this Bayesian approach, however, we can see that even though x3 isn’t “significant” in the analysis of the first data set and is in the second,5 we don’t have good evidence that the estimates are different, because the posterior distributions are broadly overlapping as evidenced by the broadly overlapping credible intervals. In other words, if we pay appropriate attention to the uncertainty of our estimates, they’re pretty stable (at least across the two sample data sets we’ve been exploring).
What about out of sample predictions?
new_data <- data.frame(x1 = 4.0, x2 = 4.0, x3 = 4.0,
                       x4 = 4.0, x5 = 4.0, x6 = 4.0,
                       x7 = 4.0, x8 = 4.0, x9 = 4.0)
## re-scale the new data by subtracting mean and dividing by standard deviation
##
new_data_1 <- (new_data - lapply(scale_1[2:10], attr, "scaled:center")) /
               lapply(scale_1[2:10], attr, "scaled:scale")
new_data_2 <- (new_data - lapply(scale_2[2:10], attr, "scaled:center")) /
               lapply(scale_2[2:10], attr, "scaled:scale")
  
predict_1 <- posterior_predict(fit_1, new_data_1)
predict_2 <- posterior_predict(fit_2, new_data_2)
summarize_posterior <- function(x, credible = 0.95, digits = 3) {
  lo_p <- (1.0 - credible)/2.0
  hi_p <- credible + lo_p
  ci <- quantile(x, c(lo_p, hi_p))
  cat(round(mean(x), 3), " (", round(ci[1], 3), ",", round(ci[2], 3), ")\n", sep = "")
}
cat("Data set 1\n")Data set 1summarize_posterior(predict_1)2.019 (0.12,3.976)cat("  True answer: ", beta0 + as.matrix(new_data_1) %*% beta, "\n", 
    sep = "")  True answer: 5.583683cat("\nData set 2\n")
Data set 2summarize_posterior(predict_2)0.963 (-0.733,2.68)cat("  True answer: ", beta0 + as.matrix(new_data_2) %*% beta, "\n", 
    sep = "")  True answer: 5.711435Those are pretty close to the Lasso predictions, but we have the advantage that they include an indication of how reliable the estimates are, and you can see that we don’t have good evidence that the predictions are different from one another even though the point estimates look fairly different. That’s the good news. The bad news is that neither of the prediction intervals include the true value.(Again, the true values differ because the scaling differs between the data sets.) If you’re getting the message that out of sample extrapolation is tricky, you’re getting the right message. Just imagine how much trickier it would be if the true relationship were non-linear rather than linear.
If you’re still paying attention (a) I congratulate you and (b) I owe you an explanation of how I set prior_coeff in the call to stan_glm(). The explanation is simple. I followed the advice in the help page for “Prior distributions and options” in the rstanarm package, i.e., help("priors"). I set it to “the ratio of the expected number of non-zero coefficients divided by the square root of the number of observations.” In fact, I cheated even a little more than that. I simply modified the code at https://mc-stan.org/projpred/articles/quickstart.html.6 I haven’t played around with other values, but I encourage you to try some. For example, instead of 3 in the code above try 1 and 8 to see how different your results are.
mtcars$mpg10 <- mtcars$mpg/10
fit <- stan_glm(mpg10 ~ wt + cyl + am, data = mtcars, QR = FALSE, refresh = 0)
prior_summary(fit)Priors for model 'fit' 
------
Intercept (after predictors centered)
 ~ normal(location = 0, scale = 10)
     **adjusted scale = 6.03
Coefficients
 ~ normal(location = [0,0,0], scale = [2.5,2.5,2.5])
     **adjusted scale = [1.54,0.84,1.51]
Auxiliary (sigma)
 ~ exponential(rate = 1)
     **adjusted scale = 0.60 (adjusted rate = 1/adjusted scale)
------
See help('prior_summary.stanreg') for more detailsPironen, J., and A. Vehtari. 2017. Sparsity information and regularization in the horseshoe and other shrinkage priors. Electronic Journal of Statistics 11(2):5018-5051. doi: 10.1214/17-EJS1337SI↩
Note: You can check this easily using prior_summary(). See above (in HTML) or here (in an R notebook) to see how, using the builtin mtcars data set as an example.↩
If you’ve been following along this far, you probably don’t regard this last step as qualifying for the word “just”, but it is the last step - really.↩
If you’re uncomfortable with Bayesian inference, it’s worth noting that the covariates identified in the two analyses are pretty similar. In fact, I singled out the same covariates as important in the analysis of data set 2 here as the Lasso identified, and all of the covariates I identified here in the analysis of data set 1 were also identified in analysis using the Lasso (the Lasso identified three more).↩
In the sense that the 95% credible interval for x3 overlaps 0 in analysis of the first data set and doesn’t in the second.↩
If you visit https://mc-stan.org/projpred/articles/quickstart.html, you’ll see that it describes a package called prodprej. We’ll explore that in the next installment.↩