Robust TOST Procedures

Aaron R. Caldwell


In this package there are currently 2 functions that provide robust alternatives to the t_TOST function.

Wilcoxon TOST

The Wilcoxon group of tests (includes Mann-Whitney U-test) provide a non-parametric test of differences between groups, or within samples, based on ranks. This provides a test of location shift, which is a fancy way of saying differences in the center of the distribution (i.e., in parametric tests the location is mean). With TOST, there are two separate tests of directional location shift to determine if the location shift is within (equivalence) or outside (minimal effect). The exact calculations can be explored via the documentation of the wilcox.test function.

In the TOSTER package, we accomplish this with the wilcox_TOST function. Overall, this function operates extremely similar to the t_TOST function. However, the standardized mean difference (SMD) is not calculated. Instead the rank-biserial correlation is calculated for all types of comparisons (e.g., two sample, one sample, and paired samples). Also, there is no plotting capability at this time for the output of this function.

As an example, we can use the sleep data to make a non-parametric comparison of equivalence.


test1 = wilcox_TOST(formula = extra ~ group,
                      data = sleep,
                      paired = FALSE,
                      low_eqbound = -.5,
                      high_eqbound = .5)

## Wilcoxon rank sum test with continuity correction
## Hypothesis Tested: Equivalence
## Equivalence Bounds (raw):-0.500 & 0.500
## Alpha Level:0.05
## The equivalence test was non-significant W = 20.000, p = 8.94e-01
## The null hypothesis test was non-significant W = 25.500, p = 6.93e-02
## NHST: don't reject null significance hypothesis that the effect is equal to zero 
## TOST: don't reject null equivalence hypothesis
## TOST Results 
##            statistic    p.value
## NHST            25.5 0.06932758
## TOST Lower      34.0 0.89385308
## TOST Upper      20.0 0.01287404
## Effect Sizes 
##                            estimate conf.level
## Median of Differences     -1.346388 -3.3999651 -0.09995341        0.9
## rank-biserial correlation -0.490000 -0.7492521 -0.10053222        0.9

Rank-Biserial Correlation

The standardized effect size reported for the wilcox_TOST procedure is the rank-biserial correlation. This is a fairly intuitive measure of effect size which has the same interpretation of the common language effect size (Kerby 2014). However, instead of assuming normality and equal variances, the rank-biserial correlation calculates the number of favorable (positive) and unfavorable (negative) pairs based on their respective ranks.

Overall, the correlation is calculated as the proportion of favorable pairs minus the unfavorable pairs.

\[ r_{biserial} = f_{pairs} - u_{pairs} \]

Confidence Intervals

The Fisher approximation is used to calculate the confidence intervals.

For paired samples the standard error is calculated as the following:

\[ SE_r = \sqrt{ \frac {(2 \cdot nd^3 + 3 \cdot nd^2 + nd) / 6} {(nd^2 + nd) / 2} } \]

wherein, nd represents the total count not equal to zero.

For independent samples, the standard error is calculated as the following:

\[ SE_r = \sqrt{\frac {(n1 + n2 + 1)} { (3 \cdot n1 \cdot n2)}} \]

The confidence intervals can then be calculated by transforming the estimate.

\[ r_z = atanh(r_{biserial}) \]

Then the confidence interval can be calculated and back transformed.

\[ r_{CI} = tanh(r_z \pm Z_{(1 - \alpha / 2)} \cdot SE_r) \]

We hope a bootstrapped version of this confidence interval will be added in future updates.

Boostrap TOST

The boostrap is a simulation based technique, derived from resampling with replacement, designed for statistical estimation and inference. Bootsrapping techniques are very useful because they are considered somewhat robust to the violations of assumptions for a simple t-test. Therefore we added a bootsrap option, boot_t_TOST to the package to provide another robust alternative to the t_TOST function.

In this function we provide a percentile bootstrap solution outlined by Efron and Tibshirani (1993) (see chapter 16, page 220). The bootstrapped p-values are derived from the “studentized” version of a test of mean differences outlined by Efron and Tibshirani (1993). Overall, the results should be similar to the results of t_TOST. However, for paired samples, the Cohen’s d(rm) effect size cannot be calculated at this time.

Two Sample Algorithm

  1. Form B bootstrap data sets from x* and y* wherein x* is sampled with replacement from \(\tilde x_1,\tilde x_2, ... \tilde x_n\) and y* is sampled with replacement from \(\tilde y_1,\tilde y_2, ... \tilde y_n\)

  2. t is then evaluated on each sample, but the mean of each sample (y or x) and the overall average (z) are subtracted from each

\[ t(z^{*b}) = \frac {(\bar x^*-\bar x - \bar z) - (\bar y^*-\bar y - \bar z)}{\sqrt {sd_y^*/n_y + sd_x^*/n_x}} \]

  1. An approximate p-value can then be calculated as the number of bootstrapped results greater than the observed t-statistic from the sample.

\[ p_{boot} = \frac {\#t(z^{*b}) \ge t_{sample}}{B} \]

The same process is completed for the one sample case but with the one sample solution for the equation outlined by \(t(z^{*b})\). The paired sample case in this bootstrap procedure is equivalent to the one sample solution because the test is based on the difference scores.


Again, we can use the sleep data to see the bootstrapped results. Notice that the plots show how the resampling via boostrapping indicates the instability of Hedges’ d(z).


test1 = boot_t_TOST(formula = extra ~ group,
                      data = sleep,
                      paired = TRUE,
                      low_eqbound = -.5,
                      high_eqbound = .5,
                    R = 999)

## Bootstrapped Paired t-test
## Hypothesis Tested: Equivalence
## Equivalence Bounds (raw):-0.500 & 0.500
## Alpha Level:0.05
## The equivalence test was non-significant, t(9) = -2.777, p = 1e+00
## The null hypothesis test was significant, t(9) = -4.062, p = 0e+00
## NHST: reject null significance hypothesis that the effect is equal to zero 
## TOST: don't reject null equivalence hypothesis
## TOST Results 
##                    t        SE df p.value
## t-test     -4.062128 0.3889587  9       0
## TOST Lower -2.776644 0.3889587  9       1
## TOST Upper -5.347611 0.3889587  9       0
## Effect Sizes 
##               estimate        SE conf.level
## Raw          -1.580000 0.3600943 -2.250000 -1.0600000        0.9
## Hedges' g(z) -1.230152 0.7235322 -2.815123 -0.9120299        0.9
## Note: percentile boostrap method utilized.


Efron, Bradley, and Robert J. Tibshirani. 1993. An Introduction to the Bootstrap. Monographs on Statistics and Applied Probability 57. Boca Raton, Florida, USA: Chapman & Hall/CRC.
Kerby, Dave S. 2014. “The Simple Difference Formula: An Approach to Teaching Nonparametric Correlation.” Comprehensive Psychology 3 (January): 11.IT.3.1.