---
title: "Introduction to the comparer R package"
author: "Collin Erickson"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Introduction to the comparer R package}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
references:
- id: gneiting
title: Strictly proper scoring rules, prediction, and estimation
author:
- family: Gneiting
given: Tilmann
- family: Raftery
given: Adrian E.
container-title: Journal of the American Statistical Association
publisher: Taylor \& Francis
page: 359-378
type: article-journal
issued:
year: 2007
---
```{r setup, echo=FALSE}
set.seed(0)
```
When coding, especially for data science, there are multiple
ways to solve each problem.
When presented with two options, you want to pick the one
that is faster and/or more accurate.
Comparing different code chunks on the same task can be tedious.
It often requires creating data, writing a for loop
(or using `sapply`), then comparing.
The comparer package makes this comparison quick and simple:
* The same data can be given in to each model.
* Various metrics can be used to judge results,
including using the predicted errors from the code.
* The results are displayed in a table that allows you
to quickly judge the results.
This document introduces the main function of the `comparer` package, `mbc`.
## Motivation from `microbenchmark`
The R package `microbenchmark` provides the fantastic eponymous function.
It makes it simple to run different segments of code and see which is faster.
Borrowing an example from http://adv-r.had.co.nz/Performance.html,
the following shows how it gives a summary of how fast each ran.
```{r}
if (requireNamespace("microbenchmark", quietly = TRUE)) {
x <- runif(100)
microbenchmark::microbenchmark(sqrt(x), x ^ .5)
} else {
"microbenchmark not available on your computer"
}
```
However it gives no summary of the output.
For this example it is fine since the output is deterministic,
but when working with randomness or model predictions we want
to have some sort of summary or evaluation metric to see which
has better accuracy, or to just see how the outputs differ.
## `mbc` to the rescue
The function `mbc` in the `comparer` package was created to solve this
problem, where a comparison of the output is desired in addition to
the run time.
For example, we may wish to see how the sample size affects
an estimate of the mean of a random sample.
The following shows the results of finding the mean of 10 and 100
samples from a normal distribution.
```{r}
library(comparer)
mbc(mean(rnorm(10)), mean(rnorm(100)))
```
By default it only runs 5 trials, but this can be changed with the
`times` parameter.
The first part of the output gives the run times.
For 5 or fewer, it shows all the values in sorted order,
for more than 5 it shows summary statistics.
Unfortunately, the timing is only accurate up to 0.01 seconds,
so these all show as 0.
The second section of the output gives the summary of the output.
This also will show summary stats for more than 5 trials,
but for this small sample size it shows all the values in sorted order
with the mean and standard deviation given.
The first column shows the name of each,
and the second column shows which output statistic is given.
Since there is only one output for this code it is called "1".
Setting `times` changes the number of trials run.
Below the same example as above is run but for 100 trials.
```{r}
mbc(mean(rnorm(10)), mean(rnorm(100)), times=100)
```
We see that the mean of both is around zero,
but that the larger sample size (`mean(rnorm(100))`) has a tighter
distribution and a standard deviation a third as large as the other,
which is about what we expect for a sample that is 10 times larger
(it should be $\sqrt{10} \approx 3.16$ times smaller on average).
In this example each function had its own input,
but many times we want to compare the functions
on the same input for better comparison.
## Shared input
Input can be passed in to the `input` argument as a list,
and then the code will be evaluated in an environment with that data.
In this example we compare the functions `mean` and `median` on random
data from an exponential distribution.
The mean should be about 1, while the median should be about
$\ln(2)=0.693$.
```{r}
mbc(mean(x), median(x), input=list(x=rexp(30)))
```
In this case each evaluation is identical since the input is not random.
The data passed to `input` is kept as is, so there is no randomness
from the data.
If we want randomness in the data, we can use `inputi`,
which evaluates its argument as an expression,
meaning that each time it will be different.
Below is the same code as above except `inputi` is used with
`x` set in brackets instead of a list.
We see there is randomness and we can get an idea of the distribution
of the median and mean.
```{r}
mbc(mean(x), median(x), inputi={x=rexp(30)})
```
When the code chunks to evaluate are simple functions of a single variable,
this can be simplified.
Look how simple it is to run a test on these!
```{r}
mbc(mean, median, inputi=rexp(30))
```
## Comparing to expected values
The previous comparisons showed a summary of the outputs,
but many times we want to compare output values to true values,
then calculate a summary statistic, such as an average error.
The argument `target` specifies the values the code chunks should
give, then summary statistics can be calculated by specifying `metrics`,
which defaults to calculating the rmse.
For example, suppose we have data from a linear function,
and want to see how accurate the model is when the output
values are corrupted with noise.
Below we compare two linear models: the first with an intercept term,
and the second without.
The model with the intercept term should be much better
since the data has an intercept of $-0.6$.
We see that the output is different in a few ways now.
The `Stat` column tells what the row is showing.
These all say `rmse`, meaning they are giving the root mean squared error
of the predicted values compared to the true `y`.
There's also a new section at the bottom title `Compare`.
This compares the `rmse` values from the two methods,
and does a t-test to see if the difference is significant.
However, since there is no randomness, it fails to perform the t-test.
```{r}
n <- 20
x <- seq(0, 1, length.out = n)
y <- 1.8 * x - .6
ynoise <- y + rnorm(n, 0, .2)
```
```{r}
mbc(predict(lm(ynoise ~ x), data.frame(x)),
predict(lm(ynoise ~ x - 1), data.frame(x)),
target = y)
```
To add randomness we can simply define `ynoise` in the `inputi` argument,
as shown below.
Now there is randomness in the data, so a paired t-test can be computed.
It is paired since the same `ynoise` is given to each model.
We see that even with only a sample size of 5,
the p-value is highly significant.
```{r}
mbc(predict(lm(ynoise ~ x), data.frame(x)),
predict(lm(ynoise ~ x - 1), data.frame(x)),
inputi={ynoise <- y + rnorm(n, 0, .2)},
target = y)
```
## Simplifying with `evaluator`
Many times the code chunks we want to compare only differ by a small amount,
such as a single argument.
In the example above, the only difference is the formula in the `lm` command.
With `mbc`, the `evaluator` can be set to make these cases easier.
The argument for `evaluator` should be an expression including `.`,
which will be replaced with the code chunks provided.
The example below rewrites the above comparison using `evaluator`.
```{r}
mbc(ynoise ~ x,
ynoise ~ x - 1,
evaluator=predict(lm(.), data.frame(x)),
inputi={ynoise <- y + rnorm(n, 0, .2)},
target = y)
```
## K-Fold Cross Validation
K-fold cross validation can also be done using `mbc`
using the `kfold` parameter.
K-fold cross validation involves splitting $N$ data points into
$k$ groups.
`kfold` should specify what this $N$ is, since it depends on the data.
By default it will set the number of folds, $k$, to be `times`.
Then each replicate will be evaluating a single fold.
Note that this will not do $k$ folds $times$ times.
To make $k$ different from $times$, pass in $kfold$ as a vector
whose second element in the number of folds.
For example, suppose you have 100 data points,
want to do 5 folds,
and repeat this process twice (i.e. evaluate 10 folds).
Then you should pass in $kfold=c(100,5)$ and $times=10$.
The first five trials would then the the five separate folds.
The sixth through tenth trials would be a new partition of
the data into five folds.
Then to use this folds you must use `ki`
as part of an expression in the code chunk or `inputi`.
The following shows how to use k-fold cross validation
fitting a linear model to the `cars` dataset.
Setting `kfold=c(nrow(cars), 5)` tells it that you want to use
5 folds on the `cars` data set. It has 50 rows, so in each trial
`ki` is a subset of `1:50` of 40 elements.
Setting `times=30` means that we are repeating the five folds
six times.
The code chunk fits the model, makes predictions on the hold-out data,
and calculates the RMSE.
```{r kfold_cars_ex}
mbc({mod <- lm(dist ~ speed, data=cars[ki,])
p <- predict(mod,cars[-ki,])
sqrt(mean((p - cars$dist[-ki])^2))
},
kfold=c(nrow(cars), 5),
times=30)
```
The following example simplifies this a little.
Setting `targetin` tells it what the input to `predict` should be
and setting `target="dist"` tells it that the target is the
`dist` element from `targetin`.
You cannot set `target=cars$dit[-ki]` since `target`
cannot be evaluated as an expression.
```{r kfold_cars_ex2}
mbc(lm(dist ~ speed, data=cars[ki,]),
targetin=cars[-ki,], target="dist",
kfold=c(nrow(cars), 5),
times=30)
```
## Metrics
In the previous example, the output shows that the "Stat" is "rmse",
meaning that it calculated the root-mean-square error from the.
predictions and target values.
The metric, or statistic, calculated can be changed using
the `metric` argument, which defaults to `rmse`.
Three of the other options for `metric` are `t`, `mis90`, and `sr27`.
These three all compare target values ($y$) to
predicted values ($\hat{y}$) and
predicted errors ($s$).
These only work for models that give predicted errors,
such as Gaussian process models.
### `metric=t`
Using these the target value, predicted value, and predicted error,
we can calculate a t-score.
$$ t = \frac{\hat{y} - y}{s} $$
The output then shows the distribution of these t-scores
by showing the six number summary.
```{r kfold_cars_metric_t}
mbc(lm(dist ~ speed, data=cars[ki,]),
targetin=cars[-ki,], target="dist",
kfold=c(nrow(cars), 5),
times=30,
metric='t')
```
### `metric=mis90`
The t-score metric is not very informative because you can get the
same t-scores by having a large error and large predicted error
as having a small error and small predicted error.
`mis90` is the mean interval score for 90\% coverage intervals
as described by @gneiting [Equation 43].
$$ 3.28s + 20 \left( \hat{y} - y - 1.64s \right)^+ + 20 \left( y - \hat{y} - 1.64s \right)^+ $$
where $()^+$ denotes the positive part of what is in the parentheses.
Smaller values are better.
This metric penalizes having large predicted errors and having
actual errors different from the predicted errors,
so it is very good for judging the accuracy of a prediction interval.
### `metric=sr27`
The scoring rule in Equation 27 @gneiting is another proper scoring rule.
$$ -\left( \frac{\hat{y} - y}{s} \right)^2 - \log s^2 $$
For this metric, larger values are better.
A problem with this metric is that if $s=0$, which can happen from
numerical issues, then it will go to infinity,
which does not happen with the mean interval score.
## Running time-consuming experiments with `ffexp`
The other main function of the package is `ffexp`, an abbreviation for
full-factorial experiment.
It will run a function using all possible combinations of input parameters
given.
It is useful for running experiments that take a long time to complete.
The first arguments given to `ffexp$new` should give the possible values for
each input parameter.
In the example below, `a` can be 1, 2, or 3,
and `b` can "a", "b", or "c".
Then `eval_func` should be given that can operate on these parameters.
For example, using `eval_func = paste` will paste together the value
of `a` with the value of `b`.
```{r}
f1 <- ffexp$new(
a=1:3,
b=c("a","b","c"),
eval_func=paste
)
```
After creating the `ffexp` object, we can call
`f1$run_all` to run `eval_func` on every combination of `a` and `b`.
```{R}
f1$run_all()
```
Now to see the results in a clean format, look at `f1$outcleandf`.
```{r}
f1$outcleandf
```
## References