---
title: "Using *outercv* with Bayesian shrinkage models"
author: "Athina Spiliopoulou, Myles Lewis"
output:
html_document:
toc: true
toc_float:
collapsed: false
toc_depth: 3
number_sections: false
fig_width: 6
vignette: >
%\VignetteIndexEntry{Using outercv with Bayesian shrinkage models}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
warning = FALSE
)
library(nestedcv)
library(pROC)
```
The two examples below implement Bayesian linear and logistic regression models
using the horseshoe prior over parameters to encourage a sparse model. Models
are fitted using the `hsstan` R package, which performs full Bayesian inference
through a `Stan` implementation. In Bayesian inference model meta-parameters
such as the amount of shrinkage are also given prior distributions and are
thus directly learned from the data through sampling. This bypasses the need to
cross-validate results over a grid of values for the meta-parameters, as would
be required to find the optimal lambda in a lasso or elastic net model.
However, Bayesian inference is computationally intensive. In high-dimensional
settings, with e.g. more than 10,000 biomarkers, pre-filtering of inputs based
on univariate measures of association with the outcome may be beneficial. If
pre-filtering of inputs is used then a cross-validation procedure is needed to
ensure that the data points used for pre-filtering and model fitting differ
from the data points used to quantify model performance. The `outercv()`
function is used to perform univariate pre-filtering and cross-validate model
performance in this setting.
#### Parallelisation
CAUTION should be used when setting the number of cores available for
parallelisation. The default setting in `hsstan` is to use 4 cores to
parallelise Markov chains in the Bayesian inference procedure. This can be
switched off either by setting `options(mc.cores = 1)`.
Argument `cv.cores` in `outercv()` controls parallelisation over the outer CV
folds. On unix/mac setting `cv.cores` to >1 will induce nested parallelisation
which will generate an error, unless parallelisation of the chains is disabled
by setting `options(mc.cores = 1)`.
Nested parallelisation is feasible if `cv.cores` is >1 and
`multicore_fork = FALSE` is set as this uses cluster based parallelisation
instead of `mclapply`. Beware that large numbers of processes may be spawned:
the code will use the product of `cv.cores` and `mc.cores`. If we are performing
10-fold cross-validation with 4 chains and set `cv.cores = 10` then 40 processes
will be invoked simultaneously.
#### Linear regression with a Bayesian shrinkage model (continuous outcomes)
We use cross-validation and apply univariate filtering of predictors and
model fitting in one part of the data (training fold), followed by evaluation
of model performance on the left-out data (testing fold), repeated in each fold.
Only one cross-validation split is needed (function `outercv()`) as the Bayesian
model does not require cross-validation for meta-parameters.
Note, in the examples below length of sampling and number of chains is curtailed
for speed. We recommend 4 chains, `warmup` 1000 and `iter` 2000 in practice.
```{r eval = FALSE}
library(hsstan)
# number of cores for parallelising hsstan sampling
# store original options in oldopt
# at the end reset options to old configuration
oldopt <- options(mc.cores = 2)
# load iris dataset and simulate a continuous outcome
data(iris)
dt <- iris[, 1:4]
colnames(dt) <- c("marker1", "marker2", "marker3", "marker4")
dt <- as.data.frame(apply(dt, 2, scale))
dt$outcome.cont <- -3 + 0.5 * dt$marker1 + 2 * dt$marker2 + rnorm(nrow(dt), 0, 2)
# unpenalised covariates: always retain in the prediction model
uvars <- "marker1"
# penalised covariates: coefficients are drawn from hierarchical shrinkage prior
pvars <- c("marker2", "marker3", "marker4") # penalised covariates
# run cross-validation with univariate filter and hsstan
res.cv.hsstan <- outercv(y = dt$outcome.cont, x = dt[, c(uvars, pvars)],
model = "model.hsstan",
filterFUN = lm_filter,
filter_options = list(force_vars = uvars,
nfilter = 2,
p_cutoff = NULL,
rsq_cutoff = 0.9),
chains = 2, # chains parallelised via options(mc.cores)
n_outer_folds = 3, cv.cores = 1, # CV folds not parallelised
unpenalized = uvars, warmup = 100, iter = 200)
```
We can then view prediction performance based on the testing folds and examine
the Bayesian model using the `hsstan` package.
```{r eval = FALSE}
summary(res.cv.hsstan)
#> Single cross-validation to measure performance
#> Outer loop: 3-fold CV
#> No inner loop
#> 150 observations, 4 predictors
#>
#> Model: model.hsstan
#> Filter: lm_filter
#> n.filter
#> Fold 1 3
#> Fold 2 3
#> Fold 3 3
#>
#> Final fit: mean sd 2.5% 97.5% n_eff Rhat
#> (Intercept) -3.17 0.14 -3.40 -2.89 221 0.99
#> marker1 0.37 0.32 -0.36 0.90 209 1.00
#> marker2 2.07 0.22 1.70 2.47 196 1.00
#> marker3 0.11 0.31 -0.43 0.92 104 1.00
#>
#> Result:
#> RMSE Rsquared MAE
#> 2.1226 0.4772 1.6971
sampler.stats(res.cv.hsstan$final_fit)
#> accept.stat stepsize divergences treedepth gradients warmup sample
#> chain:1 0.9843 0.0241 0 8 14844 0.25 0.12
#> chain:2 0.9722 0.0316 0 8 11356 0.12 0.09
#> all 0.9783 0.0279 0 8 26200 0.37 0.21
print(projsel(res.cv.hsstan$final_fit), digits = 4) # adding marker2
#> Model KL ELPD
#> Intercept only 0.32508 -374.99055
#> Initial submodel 0.32031 -374.72416
#> marker2 0.00151 -325.82289
#> marker3 0.00000 -325.88824
#> var kl rel.kl.null rel.kl elpd delta.elpd
#> 1 Intercept only 0.325077 0.00000 NA -375.0 -49.10230
#> 2 Initial submodel 0.320306 0.01468 0.0000 -374.7 -48.83591
#> 3 marker2 0.001506 0.99537 0.9953 -325.8 0.06535
#> 4 marker3 0.000000 1.00000 1.0000 -325.9 0.00000
```
Here adding `marker2` improves the model fit: substantial decrease of
KL-divergence from the full model to the submodel. Adding `marker3` does not
improve the model fit: no decrease of KL-divergence from the full model to the
submodel.
#### Logistic regression with a Bayesian shrinkage model (binary outcomes)
We use cross-validation and apply univariate filtering of predictors and
model fitting in one part of the data (training fold), followed by evaluation
of model performance on the left-out data (testing fold), repeated in each fold.
Only one cross-validation split is needed (function `outercv`) as the Bayesian
model does not require cross-validation for meta-parameters.
```{r eval = FALSE}
# sigmoid function
sigmoid <- function(x) {1 / (1 + exp(-x))}
# load iris dataset and create a binary outcome
set.seed(267)
data(iris)
dt <- iris[, 1:4]
colnames(dt) <- c("marker1", "marker2", "marker3", "marker4")
dt <- as.data.frame(apply(dt, 2, scale))
rownames(dt) <- paste0("sample", c(1:nrow(dt)))
dt$outcome.bin <- sigmoid(0.5 * dt$marker1 + 2 * dt$marker2) > runif(nrow(dt))
dt$outcome.bin <- factor(dt$outcome.bin)
# unpenalised covariates: always retain in the prediction model
uvars <- "marker1"
# penalised covariates: coefficients are drawn from hierarchical shrinkage prior
pvars <- c("marker2", "marker3", "marker4") # penalised covariates
# run cross-validation with univariate filter and hsstan
res.cv.hsstan <- outercv(y = dt$outcome.bin,
x = as.matrix(dt[, c(uvars, pvars)]),
model = "model.hsstan",
filterFUN = ttest_filter,
filter_options = list(force_vars = uvars,
nfilter = 2,
p_cutoff = NULL,
rsq_cutoff = 0.9),
chains = 2, # parallelise over chains
n_outer_folds = 3, cv.cores = 1,
unpenalized = uvars, warmup = 100, iter = 200)
```
We view prediction performance based on testing folds and examine the model.
```{r eval = FALSE}
summary(res.cv.hsstan)
#> Single cross-validation to measure performance
#> Outer loop: 3-fold CV
#> No inner loop
#> 150 observations, 4 predictors
#> FALSE TRUE
#> 78 72
#>
#> Model: model.hsstan
#> Filter: ttest_filter
#> n.filter
#> Fold 1 3
#> Fold 2 3
#> Fold 3 3
#>
#> Final fit: mean sd 2.5% 97.5% n_eff Rhat
#> (Intercept) -0.12 0.23 -0.56 0.27 200 1
#> marker1 0.50 0.30 -0.10 1.16 208 1
#> marker2 1.91 0.36 1.26 2.66 263 1
#> marker3 0.01 0.24 -0.55 0.69 171 1
#>
#> Result:
#> Reference
#> Predicted FALSE TRUE
#> FALSE 56 23
#> TRUE 22 49
#>
#> AUC Accuracy Balanced accuracy
#> 0.8284 0.7000 0.6993
# examine the Bayesian model
print(projsel(res.cv.hsstan$final_fit), digits = 4) # adding marker2
#> Model KL ELPD
#> Intercept only 0.20643 -104.26957
#> Initial submodel 0.20118 -104.17411
#> marker2 0.00060 -73.84232
#> marker3 0.00000 -73.93210
#> var kl rel.kl.null rel.kl elpd delta.elpd
#> 1 Intercept only 2.064e-01 0.00000 NA -104.27 -30.33748
#> 2 Initial submodel 2.012e-01 0.02543 0.000 -104.17 -30.24201
#> 3 marker2 5.964e-04 0.99711 0.997 -73.84 0.08977
#> 4 marker3 9.133e-18 1.00000 1.000 -73.93 0.00000
options(oldopt) # reset options
```
Here adding `marker2` improves the model fit: substantial decrease of
KL-divergence from the full model to the submodel. Adding `marker3` does not
improve the model fit: no decrease of KL-divergence from the full model to the
submodel.
# Note
At time of writing, there appears to be a bug in `rstan` (used by `hsstan`)
leading to it ignoring the pass-thru argument `cores` and instead spawning
multiple processes as specified by the `chain` argument. This behaviour can be
limited by setting `options(mc.cores = ..)`.
# Troubleshooting
A key problem with parallelisation in R is that errors, warnings and user input
have to be suppressed during multicore processing. If a `nestedcv` call is not
working, we recommend that you try it with `cv.cores = 1` first to check it
starts up without error messages.
# Citation
If you use this package, please cite as:
Lewis MJ, Spiliopoulou A, Goldmann K, Pitzalis C, McKeigue P, Barnes MR (2023).
nestedcv: an R package for fast implementation of nested cross-validation with
embedded feature selection designed for transcriptomics and high dimensional
data. *Bioinformatics Advances*. https://doi.org/10.1093/bioadv/vbad048
# References
Carpenter, B., et al. Stan: A Probabilistic Programming Language. *Journal of Statistical Software* 2017;76(1):1-32.
Piironen, J. and Vehtari, A. Sparsity information and regularization in the horseshoe and other shrinkage priors. *Electronic Journal of Statistics* 2017;11(2):5018-5051, 5034.