# Background

Metabolomic studies involve the measurement of spectra from biological samples collected by researchers from their study cohort and the analysis of the resulting data. Often the goal of such studies is to identify metabolites which exhibit different behaviour in samples belonging to different groups in the study cohort. In this way they may act as an identifier of the particular feature being studied. Studies must utilise an adequate sample size such that the study achieves an appropriate statistical power to validate its conclusions. Clearly this sample size must be determined before commencing the study. In many biological fields an optimal sample size is determined by performing a pilot study, however, this is not always possible for metabolomic studies due to factors such as experiment cost, availability of resources, and availability of subjects. MetSizeR (Nyamundanda et al. 2013) offers a tool to estimate the sample size needed for an experiment to achieve a desired statistical power, without the use of pilot data or historical data.

MetSizeR operates on the idea of an analysis informed approach, meaning that the sample size should be estimated for an experiment based on the method of analysis the researcher plans to use. There are currently two analysis methods supported by MetSizeR; probabilistic principal components analysis (PPCA), originally developed by Tipping and Bishop (1999), and probabilistic principal components and covariates analysis (PPCCA), developed by Nyamundanda et al. (2010). The application operates by using simulated data in the place of pilot data to perform its estimation. The data are simulated based on the analysis method the researcher plans to use, and the sample size which yields the desired false discovery rate (FDR) for the study is returned to the user.

Full details of the MetSizeR algorithm can be found in Nyamundanda et al. (2013).

A Shiny application is used as a graphic user interface (GUI) for the application, to encourage use in the field without the need for experience using R.

# Package Functionality

The MetSizeR package provides a R Shiny application which allows users to estimate the sample size required for their study to achieve a desired statistical power. The tool is built to estimate the sample size required for both targeted and untargeted metabolomic experiments. Sample size estimation is performed without the need for pilot data, however, if pilot data are available then these can be uploaded to the application to aid in the estimations. Several inputs are required by the user; it is expected that these inputs should be informed by the user’s own knowledge of their field. Additionally, where pilot data are present it is expected that these data have been sourced and treated appropriately in the context of the user’s research.

# Installation Instructions

To install the MetSizeR package, the user should have the R software environment installed on their machine. To download, visit the website for The R Project for Statistical Computing and follow the instructions to download and install.

When in R, the following command will allow the user to install the MetSizeR package:

install.packages("MetSizeR")

Package dependencies will also be installed if they are not already installed on the user’s computer. These consist of the following:

• dplyr
• ggplot2
• MetabolAnalyze
• Rfast
• shiny
• shinythemes
• stats
• tools
• utils
• vroom

Alternatively, MetSizeR can be downloaded and installed manually by the user if they so prefer by visiting its entry on The Comprehensive R Archive Network (CRAN).

Once installed, the MetSizeR package can be loaded into the user’s current R session by the following command:

library(MetSizeR)

The MetSizeR R package is designed to perform all functionality through the associated R Shiny application. The Shiny application can be launched by running the following command:

MetSizeR()

The application will then launch, either in the user’s external browser, a pop-up window, or in the viewer pane of the user’s R IDE. This depends on how the user has configured their own settings.

Once the application has launched, the initial landing page is an About page which contains information about the package, its functionality, and references for the methods used within. Navigation through the application occurs through the navigation bar at the top of the page, with options About, Sample Size Estimation, and Vary Proportion of Significant Spectral Bins leading to the about page, the page on which to perform sample size estimation for a single experiment, and the page on which to estimate the sample size for varying proportions of significant bins or metabolites respectively.

# Sample Size Estimation

The core functionality of MetSizeR is the ability to estimate the sample size required for a metabolomic experiment to achieve a desired statistical power. MetSizeR can perform this estimation for targeted or untargeted experiments, both for cases when pilot data are available or unavailable. Sample size estimation takes place based on the intended method of analysis in the experiment; at present the tool is available for PPCA or PPCCA. To navigate to the sample size estimation tool, select Sample Size Estimation on the navigation bar at the top of the page.

## Sample Size Estimation in the Absence of Pilot Data

MetSizeR can estimate the sample size required for a study to achieve a desired statistical power even when pilot data are unavailable. In this case, the tool can be used for both targeted and untargeted analysis. The user must specify several parameters for the algorithm to use, given as follows. These are specified by selecting or inputting options on the sidebar located on the left side of the page.

### Inputs

• The intended type of analysis, either targeted or untargeted, which is selected by clicking the relevant button at the top of the input section.
• Whether pilot data are available. The default is unavailable, indicated by the blank Are experimental pilot data available? checkbox.
• The number of spectral bins (untargeted analysis) or the number of metabolites (targeted analysis). This is the number of variables, either bins or metabolites, which will be generated by the algorithm for each sample. This number should be typed in the box available, however, up and down arrows on the right side of the box can be use to adjust the number entered. For targeted analysis, metabolite numbers from 20 to 1000 are accepted, and for untargeted analysis, bin numbers from 50 to 3000 are accepted. Please note that the algorithm make take several minutes to run for larger numbers of bins.
• The proportion of significant bins (untargeted analysis) or the proportion of significant metabolites (targeted analysis). This is the proportion of the bins or metabolites present which the user expects should be statistically significant, and is generally expected to be less than 0.5. In the case where this proportion is unknown, there is a feature of the application which allows simultaneous estimations to be made for several proportions at once, which will be outlined later in this handbook. The proportion can be typed into the box available, with arrows on the right side of the box allowing adjustments.
• The model to be used in analysis, either PPCA or PPCCA, which must be selected by clicking the relevant button.
• If PPCCA is the method of choice, two other options will appear, asking the user to specify the number of numeric and categorical covariates. If any categorical covariates are expected then the total number of levels present for all categorical covariates must also be input. These numbers can be typed into the relevant boxes, or adjusted using the arrows present on the right side of the boxes. Inputs containing zero to five numeric and zero to five categorical covariates are accepted. The maximum number of levels accepted is 50, with a minimum of two levels required.
• The target FDR, which is the FDR the researcher desires for their experiment. This is equivalent to $$1-power$$ of the study. This can be entered into the relevant box, either by typing or adjusting the value using the arrows.
• The sample size per group. The algorithm is based on an experimental set-up whereby samples originate from two groups to be compared. The minimum sample size to test for each group must be specified by the user. A minimum value of three samples per group is required. Please note that the ratio of samples per group will be fixed at the ratio given.

Once the input values have been entered to the user’s specifications, the Estimate Optimal Sample Size button at the bottom of the sidebar must be clicked to start the MetSizeR process. Note that the process may take several minutes for larger numbers of bins. The user will know the algorithm is running as a notification will appear on the bottom right of the screen.

When results are ready they will appear on the right side of the screen. A plot of FDR versus sample size will appear, showing the 90th, 50th, and 10th percentiles of the FDR values calculated for each sample size tested. The estimated optimal sample size will be indicated by a blue vertical line, and also in text in the legend of the plot. The target FDR is given by a black dotted line on the plot.

Below the plot, results will be displayed in text. The estimated optimal sample size will be printed, along with the per-group breakdown of the sample sizes, in the same ratio as input by the researcher. There is an option to download the plot, by clicking the Download Plot button. The plot will then download to the location of choice on the user’s computer. There is also an option to view the exact values from the points on the plot by clicking the Show values from plot? checkbox. This opens a table showing the values of the 90th, 50th, and 10th percentiles of the FDR values calculated alongside the relevant sample size. The user can download these data as a CSV file by clicking the Download Plot Data as .csv button, which will download the data to the location of the user’s choosing.

If the user wishes to change any of the inputs and re-run the algorithm, they can simply change the relevant inputs and click Estimate Optimal Sample Size once more, and the results will update. The user will know the algorithm is running as a notification will appear on the bottom right of the screen.

## Sample Size Estimation Using Pilot Data

If pilot data are available then MetSizeR can use this to aid its estimation of the optimal sample size required for the study to achieve a desired level of power. To estimate the sample size in this manner, navigate to the Sample Size Estimation page on the navigation bar at the top of the page. This page contains a sidebar on the left side which allows the user to input specifications of their planned study as well as upload pilot data. The following specifications are required from the user:

### Inputs

• The intended type of analysis, either targeted or untargeted, which is selected by clicking the relevant button at the top of the input section.
• Whether pilot data are available. To upload pilot data, select the Are experimental pilot data available? checkbox. This will open several options to the user. Firstly, the user must specify if their data file contains a header as the first row, if so click the Does data contain a header? checkbox. The option to upload data as a CSV file is then given. To select the data file, click Browse, navigate to the file’s location and select the file. A blue confirmation bar should appear under the file upload section, saying Upload complete, and the name of the file should appear in the box beside the Browse button.
• Data file specifications: Note that the uploaded data must be correctly formatted. The data should be in CSV format and the file should not exceed 5MB in size. If covariate data are included, this must be given in the first few columns before any spectral data. Categorical covariates should be included before numeric covariates, if applicable. For example, if a data set has one categorical covariate, two numeric covariates, and 150 spectral bins, then column one should be the categorical covariate, columns two and three the numeric covariates, and column four to 153 should contain the spectral data.
• The proportion of significant bins (untargeted analysis) or the proportion of significant metabolites (targeted analysis). This is the proportion of the bins or metabolites present which the user expects should be statistically significant, and is generally expected to be less than 0.5. The proportion can be typed into the box available, with arrows on the right side of the box allowing adjustments.
• The model to be used in analysis, either PPCA or PPCCA, which must be selected by clicking the relevant button.
• If PPCCA is the method of choice, two other options will appear, asking the user to specify the number of numeric and categorical covariates. These numbers can be typed into the relevant boxes, or adjusted using the arrows present on the right hand side of the box. Inputs containing zero to five numeric and zero to five categorical covariates are accepted. The maximum number of levels accepted is 50, with a minimum of two levels required. Note that when pilot data are present there is no option to specify the number of levels present, as this will be gleaned from the pilot data uploaded.
• The target FDR, which is the FDR the researcher desires for their experiment. This is equivalent to $$1-power$$ of the study. This can be entered into the relevant box, either by typing or adjusting the value using the arrows.
• The sample size per group. The algorithm is based on an experimental set-up whereby samples originate from two groups to be compared. The minimum sample size to test for each group must be specified by the user. A minimum value of three samples per group is required. Please note that the ratio of samples per group will be locked in’ at the ratio given.

Note that the number of spectral bins (untargeted analysis) or metabolites (targeted analysis) does not need to be specified by the user when pilot data are present as this is read from the uploaded file. When the input values are correctly specified, the Estimate Optimal Sample Size button at the bottom of the sidebar must be clicked to start the MetSizeR process. Note that the process may take several minutes for larger data. The user will know the algorithm is running as a notification will appear on the bottom right of the screen.

When results are ready they will appear on the right side of the screen. A plot of FDR versus sample size will appear, showing the 90th, 50th, and 10th percentiles of the FDR values calculated for each sample size tested. The estimated optimal sample size will be indicated by a blue vertical line, and also in text in the legend of the plot. The target FDR is given by a black dotted line on the plot.

Below the plot, results will be displayed in text. The estimated optimal sample size will be printed, along with the per-group breakdown of the sample sizes, in the same ratio as input by the researcher. There is an option to download the plot, by clicking the Download Plot button. The plot will then download to the location of choice on the user’s computer. There is also an option to view the exact values from the points on the plot by clicking the Show values from plot? checkbox. This opens a table showing the values of the 90th, 50th, and 10th percentiles of the FDR values calculated alongside the relevant sample size. The user can download these data as a CSV file by clicking the Download Plot Data as .csv button, which will download the data to the location of the user’s choosing.

If the user wishes to change any of the inputs and re-run the algorithm, they can simply change the relevant inputs and click Estimate Optimal Sample Size once more, and the results will update. The user will know the algorithm is running as a notification will appear on the bottom right of the screen.

## Varying the Proportion of Significant Metabolites/Bins

If the user wishes to test different proportions of significant bins to find the optimal sample size, they can navigate to the Vary Proportion of Significant Spectral Bins tab on the navigation bar at the top of the page. Here, up to four different significance proportions can be tested for the same experimental design.

First the user should specify whether they intend to perform targeted or untargeted analysis by selecting the applicable option at the top of the sidebar on the left of the page. There are then boxes on the sidebar where the user can enter up to four proportions. One proportion should be entered in each box. If less than four proportions are required, zero should be entered in the remaining boxes. The values can be input by either typing the proportion as a decimal, or using the arrows on the right of the boxes to change the values.

The user must then also choose the number of spectral bins (untargeted analysis) or metabolites (targeted analysis) to test by typing the desired number into the relevant box or using the arrows. They must then select the desired model to use, either PPCA or PPCCA, by selecting the button for their chosen model. If PPCCA is selected, the number of numeric and categorical covariates must be specified, with the number of levels of any categorical variables entered also. The target FDR must then be specified, by typing in the desired value or using the arrows to select. Once the inputs are entered to the user’s specifications, the Run MetSizeR for Varied Proportions button should be clicked to start the process. Note that this process can take some time, especially for larger numbers of bins. A notification will appear on the bottom right of the screen, indicating that the algorithm is running. This process will usually be longer than the single sample size calculation performed on the Sample Size Estimation` tab, as there are more calculations to be performed.

When results are ready, one plot for each of the specified proportions will appear on the right side of the screen, along with a statement of their respective proportions. A download button is available for each plot which, when clicked, will allow the user to download the plot as a PNG file to the location of their choosing.

# References

Nyamundanda, G., Brennan, L., & Gormley, I. C. (2010). Probabilistic principal component analysis for metabolomic data. BMC Bioinformatics, 11(1), 571. https://doi.org/10.1186/1471-2105-11-571

Nyamundanda, G., Gormley, I. C., Fan, Y., Gallagher, W. M., & Brennan, L. (2013). MetSizeR: Selecting the optimal sample size for metabolomic studies using an analysis based approach. BMC Bioinformatics, 14(1), 338. https://doi.org/10.1186/1471-2105-14-338

Tipping, M. E., & Bishop, C. M. (1999). Probabilistic Principal Component Analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3), 611–622. https://doi.org/10.1111/1467-9868.00196