# Leave-one-out cross-validation and error correction

#### 2021-04-06

Cross-validation is often used in machine learning to judge how well a model is fit. Instead of using the entire data set to fit the model, it will use one part of the data set to fit a model and then test the model on the remaining data. This gives an idea of how well the model will generalize to indpendent data.

## Leave-one-out predictions using Gaussian processes

Leave-one-out prediction uses an entire model fit to all the data except a single point, and then makes a prediction at that point which can be compared to the actual value. It seems like this may be very expensive to do, but it is actually an inexpensive computation for a Gaussian process model, as long as the same parameters are used from the full model. This will bias the predictions to better results than if parameters were re-estimated.

Normally each prediction point requires solving a matrix equation. To predict the output, $$y$$, at point $$\mathbf{x}$$, given input data in matrix $$X_2$$ and output $$\mathbf{y_2}$$, we use the equation $\hat{y} = \hat{\mu} + R(\mathbf{x},~X_2) R(X_2)^{-1}( \mathbf{y_2} - \mu\mathbf{1_n}))$ For leave-one-out predictions, the matrix $$X_2$$ will have all the design points except for the one we are predicting at, and thus will be different for each one. However, we will have the correlation matrix $$R$$ for the full data set from estimating the parameters, and there is a shortcut to find the inverse of a matrix leaving out a single row and column.

There is significant speed-up by using a multiplication instead of a matrix solve. The code chunk below shows that solving with a square matrix with 200 rows is over 30 times slower than a matrix multiplication.

n <- 200
m1 <- matrix(runif(n*n),ncol=n)
b1 <- runif(n)
if (requireNamespace("microbenchmark", quietly = TRUE)) {
microbenchmark::microbenchmark(solve(m1, b1), m1 %*% b1)
}
## Unit: microseconds
##           expr      min       lq       mean   median       uq      max neval
##  solve(m1, b1) 2170.372 2184.184 2304.73951 2211.632 2286.168 8277.518   100
##      m1 %*% b1   37.667   38.164   48.88542   39.753   42.956  722.822   100
##  cld
##    b
##   a

### Getting the inverse of a submatrix

Suppose we have a matrix $$K$$ and know its inverse $$K^{-1}$$. Suppose that $$K$$ has block structure $K = \begin{bmatrix} A~B \\ C~D \end{bmatrix}$ Now we want to find out how to find $$A^{-1}$$ using $$K^{-1}$$ instead of doing the full inverse. We can write $$K^{-1}$$ in block structure $K^{-1} = \begin{bmatrix} E~F \\ G~H \end{bmatrix}$

Now we use the fact that $$K K^{-1} = I$$ $\begin{bmatrix} I~0 \\ 0~I \end{bmatrix} = \begin{bmatrix} A~B \\ C~D \end{bmatrix}\begin{bmatrix} E~F \\ G~H \end{bmatrix}$

This gives the equations $AE + BG = I$ $AF + BH = 0$ $CE + DG = 0$ $CF + DH = I$

Solving the first equation gives that $A = (I - BG)E^{-1}$ or $A^{-1} = E (I - BG) ^{-1}$

### Leave-one-out covariance matrix inverse for Gaussian processes

For Gaussian processes we can consider the block matrix for the covariance (or correlation) matrix where a single row and its corresponding column is being removed. Let the first $$n-1$$ rows and columns be the covariance of the points in design matrix $$X$$, while the last row and column are the covariance for the vector $$\mathbf{x}$$ with $$X$$ and $$\mathbf{x}$$. Then we can have

$K = \begin{bmatrix} C(X,X)~ C(X,\mathbf{x}) \\ C(\mathbf{x},X)~C(\mathbf{x},\mathbf{x}) \end{bmatrix}$

Using the notation from the previous subsection we have $$A = C(X,X)$$ and $$B=C(X,\mathbf{x})$$, and $$E$$ and $$G$$ will be submatrices of the full $$K^{-1}$$. $$B$$ is a column vector, so I’ll write it as a vector $$\mathbf{b}$$, and $$G$$ is a row vector, so I’ll write it as a vector $$\mathbf{g}^T$$. So we have $C(X,X)^{-1} = E(I-C(X,x)G)^{-1}$ So if we want to calculate $A^{-1} = E (I - \mathbf{b}\mathbf{g}^T) ^{-1}$ we still have to invert $$I-BG$$, which is a large matrix. However this can be done efficiently since it is a rank one matrix using the Sherman-Morrison formula. $(I - \mathbf{b}\mathbf{g}^T)^{-1} = I^{-1} - \frac{I^{-1}\mathbf{b}\mathbf{g}^TI^{-1}}{1+\mathbf{g}^TI^{-1}\mathbf{b}} = I - \frac{\mathbf{b}\mathbf{g}^T}{1+\mathbf{g}^T\mathbf{b}}$ Thus we have the shortcut for $$A^{-1}$$ that is only multiplication $A^{-1} = E (I - \frac{\mathbf{b}\mathbf{g}^T}{1+\mathbf{g}^T\mathbf{b}})$

To speed this up it should be calculated as

$A^{-1} = E - \frac{(E\mathbf{b})\mathbf{g}^T}{1+\mathbf{g}^T\mathbf{b}}$ Below demonstrates that we get a speedup of almost twenty by using this shortcut.

set.seed(0)
corr <- function(x,y) {exp(sum(-30*(x-y)^2))}
n <- 200
d <- 2
X <- matrix(runif(n*d),ncol=2)
R <- outer(1:n,1:n, Vectorize(function(i,j) {corr(X[i,], X[j,])}))
Rinv <- solve(R)
A <- R[-n,-n]
Ainv <- solve(A)
E <- Rinv[-n, -n]
b <- R[n,-n]
g <- Rinv[n,-n]
Ainv_shortcut <- E + E %*% b %*% g / (1-sum(g*b))
summary(c(Ainv - Ainv_shortcut))
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max.
## -4.542e-03 -2.300e-08  0.000e+00  2.000e-09  2.300e-08  3.052e-03
if (requireNamespace("microbenchmark", quietly = TRUE)) {
microbenchmark::microbenchmark(solve(A), E + E %*% b %*% g / (1-sum(g*b)))
}
## Unit: microseconds
##                                expr      min        lq      mean    median
##                            solve(A) 5958.883 6156.8425 6334.9344 6255.1155
##  E + E %*% b %*% g/(1 - sum(g * b))  273.644  281.7155  338.3906  302.5305
##         uq      max neval cld
##  6429.0730 8683.355   100   b
##   347.6985 2408.890   100  a

In terms of the covariance matrices already calculated, this is the following, where $$M_{-i}$$ is the matrix $$M$$ with the ith row and column removed, and $$M_{i,-i}$$ is the ith row of the matrix $$M$$ with the value from the ith column removed.

$R(X_{-i})^{-1} = R(X)_{-i} - \frac{(R(X)_{-i}R(X)_{-i,i}) (R(X)^{-1})_{i,-i}^T }{1 + (R(X)^{-1})_{i,-i}^T R(X)_{-i,i}}$

### Leave-one-out prediction

Recall that the predicted mean at a new point is $\hat{y} = \hat{\mu} + R(\mathbf{x},~X_2) R(X_2)^{-1}( \mathbf{y_2} - \mu\mathbf{1_n}))$

$\hat{y} = \hat{\mu} + R(\mathbf{x_i},~X_{-1}) R(X_{-i})^{-1}( \mathbf{y_{-i}} - \mu\mathbf{1_n}))$