## Overview

SIMMS enables intergration of molecular profiles with functional networks such as protein-protein interaction networks. This document shows how to use SIMMS from a simple use case of integrating mRNA abundance data with pathways derived protein-protein networks, through to more sophisticated use cases of integrating multiple molecular profiles such as mRNA abundance, DNA copy-number and DNA-methylation data. Note, SIMMS requires patient outcome data as dependent variable. In current implementation, it works with survival data Survival time, Survival status. Please setup the dataset metadata directories before using SIMMS.

There are a couple of directories that one needs to setup and ensure the contents are in correct format. These directories contain:

• Networks database (optional)
• Molecular profiles and clinical data

The structure and format of these inputs is described below.

## Networks database

SIMMS rely on functionally or computationally derived networks in order to meaningfully integrate molecular profiles. By default, SIMMS package has a built in database of pathway derived protein-protein interaction networks extracted from (a dated version) Reactome, BioCarta and NCI-PID. This database can be accessed through various SIMMS functions using the parameter setting networks.database = "default". If you would like to create your own networks database for subsequent use with SIMMS, please note that the networks database is organised in two files:

pathway_based_sub_networks.txt
pathway_based_networks_flattened.txt

File pathway_based_sub_networks.txt contains all the subnetworks. For instance a hypothetical subnetwork of five genes ERBB2, EGFR, MKI67, ESR1, PGR with four interactions (PGR-ESR1, EGFR-ERBB2, EGFR-PGR, MKI67-ESR1), and another hypothetical subnetwork of three genes PDK1, AKT3 and PIK3CA with two interactions (PDK1-PIK3CA, AKT3-PIK3CA) using Entrez Gene IDs would look like (tab-delimited):

#ID=0001    NAME=Module.1
ID=0001 5241    2099
ID=0001 1956    2064
ID=0001 1956    5241
ID=0001 4288    2099
#ID=0002    NAME=Module.2
ID=0002 5163    5290
ID=0002 10000   5290

Please note that the only Entrez IDs are supported in this file, which should match the Entrez IDs format in the molecular data files (except additional _at suffix in the molecular data files).

File pathway_based_networks_flattened.txt is a non-redundant, without self-interactions, quick lookup table of all the pairwise interactions present in the file pathway_based_sub_networks.txt. This should not contain the subnetwork name and ID annotations. For example, the contents of this file for the above-mentioned subnetworks would look like (tab-delimited):

GeneID1 GeneID2
5241    2099
1956    2064
1956    5241
4288    2099
5163    5290
10000   5290    

Please note that only Entrez IDs are supported in this file, which should match the Entrez IDs format in the molecular data files (except additional _at suffix in the molecular data files). If you have a gene list without known interactions, and would like to try Node-only model (model 2), you can create a subnetwork with random interactions in pathway_based_sub_networks.txt as long as all the features (genes) are present and connected. Same goes for pathway_based_networks_flattened.txt.

The two database files are placed inside a directory you wish to call your database. For example: MyNetworkDB, and this directory should be placed under:

SIMMS/inst/programdata/networkdb/

You would need to build and install SIMMS package again and your networks database should be available to use through relevant SIMMS’ functions with parameter setting networks.database = "MyNetworkDB"

## Molecular profiles and clinical data

The main input to SIMMS is molecular and clinical data. Currently tested datatypes are mRNA abundance, DNA copy-number and DNA methylation datasets. mRNA abundance and DNA methylation profiles are expected to be in continuous scale, while DNA copy-number profiles are expected to be gene copy-number calls (-2,-1,0,1,2) or (-1,0,1) representing deletions (-ve), neutral (0), and gains/amplifications (+ve) calls. Given that SIMMS support both continuous and categorical/ordinal profiles, any datatype can be used as input. The table below shows the supported/unsupported datatypes and respective examples:

Data type Example data Example profile
{ x ∈ ℝ : x ≥ 0 } 0, 2, 1.37, 7.04, 9.68 log2 mRNA or miRNA abundance
{ x ∈ ℝ : x ≥ 0 } 0.1, 0.38, 0.78, 0.22, 0.98 DNA methylation beta values
{ x ∈ ℤ } -2, -1, 0, 1, 2 copy-number calls. Reference group = 0 (Neutral/Diploid)
{ x ∈ ℤ : x ≥ 0 } 0, 1, 2, 3 mutation data. Reference group = 0 (Wildtype)
{ x ∈ ℤ : x ≠ 0 } -2, -1, 1, 2 Unsupported due to missing reference group 0
{ x ∉ ℝ } WT, Mutant, Gain, Deleted Unsupported due to alphabets

Molecular profiles are strictly tab-delimited, and are expected to be in feature (gene) by sample (patient) matrices. Genes should be represented using Entrez IDs followed by _at suffix. For mRNA abundance data, if you have pre-processed your data with BrainArray Entrez CDFs Entrez CDF, your data should be SIMMS compatible by default. Otherwise, please map your dataset features e.g genes to Entrez IDs followed by _at suffix (e.g 5290_at).

Clinical profiles are strictly tab-delimited, sample (patient) by annotation matrices. The annotation columns must contain survival columns Survival time, Survival status, Survival time unit. The row names should match the column names in the molecular profiles’ matrices. These columns in clinical annotation file could be called anything as long as these are correctly identified through the metadata file described below.

Both molecular and clinical annotations are read by SIMMS through a tab-delimited metadata file called datasets.txt. An example of the contents of this file are shown below:

dataset mRNA    cna methylation annotations survstat    survtime    survtime.unit
Breastdata1 mRNA_abundance.txt  CNA.txt DNA_methylation.txt patient_annotation.txt  e.os    t.os    t.os.unit
Breastdata2 mRNA_abundance.txt  CNA.txt DNA_methylation.txt patient_annotation.txt  e.os    t.os    t.os.unit

Here, each row contain an entry for a dataset e.g Breastdata1 and Breastdata2, each having its own directory on the filesystem. mRNA, cna and methylation columns contain the file names of each of the these datatypes. annotations column contains the names of the annotation files for each dataset. All the datatypes and annotation files for a given dataset must be inside the dataset directory (e.g Breastdata1/). survstat, survtime, and survtime.unit columns contain the column names of Survival status, Survival time and Survival time unit, respectively. These column names are expected to match the column names in the annotations files. annotation data file must also have a column Tumour with possible values of Yes|No. All analyses will be limited to those samples with Yes in the Tumour column. Further subsetting of molecular and annotation datasets is enable using parameter subset in SIMMS’ functions ?derive.network.features & ?prepare.training.validation.datasets. For convenience sake, an example test dataset testdata containing metadata file datasets.txt, mRNA abundance profiles and respective clinical data is bundled with SIMMS package, and can be found under:

SIMMS/inst/programdata/testdata/

There is no need to drop your datasets inside the package, or need to rebuild the package. You can just point to your datasets through SIMMS package keeping them anywhere on your filesystem.

## Example R-script

Lets start with a simple case. We have two mRNA abundance datasets; Breastdata1, Breastdata2. We would like to identify prognostic biomarkers using Breastdata1 and validate on Breastdata2. Assuming these two datasets are setup correctly (as described in the previous section), a typical workflow would be:

options("warn" = -1);

library("SIMMS");

# path of the data directory containing Breastdata1/ and Breastdata2/ subdirectories
data.directory <- SIMMS::get.program.defaults(networks.database = "test")[["test.data.dir"]];

# path of the directory where results will be stored
output.directory <- tempdir();

# molecular profiles to be used
data.types <- c("mRNA");

# feature selection datasets
# name of the dataset directory containing mRNA abundance and annotation profiles of training dataset/s
feature.selection.datasets <- c("Breastdata1");

# model training datasets, ideally same as feature selection datasets
# name of the dataset directory containing mRNA abundance and annotation profiles of training dataset/s
training.datasets <- feature.selection.datasets;

# validation datasets
# name of the dataset directory containing mRNA abundance and annotation profiles of validation dataset/s
validation.datasets <- c("Breastdata2");

# all the possible P value thresholds that one may consider applying to feature selection process.
# its the P value of univariate (genewise) Cox model statistics
feature.selection.p.thresholds <- c(0.5);

# one of the P values above, to be used for subsequent analysis. Not a vector for performance reasons
feature.selection.p.threshold <- 0.5;

# names of the learning algorithms to be used for the final multivarite model
learning.algorithms <- c("backward", "forward", "glm", "randomforest");

# top features to be used for model selection (Backwards elimination, Forward selection, GLM, Random survival forest)
# you can try a number of different model selection runs by specifying a vector of top n features
top.n.features <- c(5);

# truncate survival
truncate.survival <- 10;

# calculate per feature univariate coefficients in training sets
derive.network.features(
data.directory = data.directory,
output.directory = output.directory,
data.types = data.types,
feature.selection.datasets = feature.selection.datasets,
feature.selection.p.thresholds = feature.selection.p.thresholds,
networks.database = "test", # or "default" for Reactome/BioCarta/NCI-PID
truncate.survival = truncate.survival
);

# calculate per-subnetwork scores in both training and validation sets
prepare.training.validation.datasets(
data.directory = data.directory,
output.directory = output.directory,
data.types = data.types,
p.threshold = feature.selection.p.threshold,
feature.selection.datasets = feature.selection.datasets,
datasets = c(training.datasets, validation.datasets),
networks.database = "test", # or "default" for Reactome/BioCarta/NCI-PID
truncate.survival = truncate.survival
);

# iterate over varying top n features, identify and validate survival models
for (top.n in top.n.features) {

# create classifier assessing univariate prognostic power of subnetwork modules (Train and Validate)
ret <- create.classifier.univariate(
data.directory = data.directory,
output.directory = output.directory,
feature.selection.datasets = feature.selection.datasets,
feature.selection.p.threshold = feature.selection.p.threshold,
training.datasets = training.datasets,
validation.datasets = validation.datasets,
top.n.features = top.n
);

# create a multivariate classifier (Train and Validate)
ret <- create.classifier.multivariate(
data.directory = data.directory,
output.directory = output.directory,
feature.selection.datasets = feature.selection.datasets,
feature.selection.p.threshold = feature.selection.p.threshold,
training.datasets = training.datasets,
validation.datasets = validation.datasets,
learning.algorithms = learning.algorithms,
top.n.features = top.n
);

# perform Kaplan-Meier analysis and generate plots
create.survivalplots(
data.directory = data.directory,
output.directory = output.directory,
training.datasets = training.datasets,
validation.datasets = validation.datasets,
top.n.features = top.n,
learning.algorithms = learning.algorithms,
truncate.survival = truncate.survival,
survtime.cutoffs = c(5),
main.title = FALSE,
KM.plotting.fun = "create.KM.plot",
resolution = 100
);
}

Please note that a number of parameters such as output.directory, training.datasets and validations.datasets are repeatedly passed in various methods. This may look mindless, however, this is because none of the methods return objects that are passed over to the next method/s. The rationale behind this was to keep the memory footprint to bare-minimum by avoiding large R objects passed around. This particular feature also facilitates restarting from any step after deliberate or accidental closure of R environment as everything can be read again from the filesystem. The only compromise is the data footprint on the filesystem.

The above R-script will generate two sub-directories under the output.directory path:

output/ and graphs/

## Output

SIMMS creates two output directories; output/ and graphs/. The contents of these directories are described below:

### output/

coxph_nodes__$training.datasets__datatype_$data.types.txt coxph_edges_coef__$training.datasets__datatype_$data.types.txt coxph_edges_P__$training.datasets__datatype_$data.types.txt

Contains per feature (nodes) univariate Cox proportional hazards model results, and per interaction (gene-gene edge) Cox proportional hazards model results

top_subnets_score__TRAINING_$training.datasets__model_1,2,3__PV_$feature.selection.p.thresholds.txt

Contains per subnetwork scores and is used for subsequent feature selection. Models 1, 2 and 3 refers to Node+Interaction, Node-only and Interaction-only models. Nodel-only (Model-2) is generally the most interpretable and robust model

riskscores_uv__all_training,all_validation__TRAINING_$training.datasets__model_1,2,3__top_$top.n.txt

Contains per sample risk scores for each subnetwork, along with Survival time and Survival status. Sample by risk score matrix. First two columns contain survival data

riskgroups_uv__all_training,all_validation__TRAINING_$training.datasets__model_1,2,3__top_$top.n.txt

Contains per sample risk groups for each subnetwork, along with Survival time and Survival status. Sample by risk group matrix. First two columns contain survival data. Risk groups are median-dichotomised (training set) risk scores

riskscores__all_training,all_validation__TRAINING_$training.datasets__backward,forward,glm,randomforest__model_1,2,3__top_$top.n.txt

Contains per sample risk scores estimated by the multivariate Cox proportional hazards model, along with Survival time and Survival status

riskgroups__all_training,all_validation__TRAINING_$training.datasets__backward,forward,glm,randomforest__model_1,2,3__top_$top.n.txt

Contains per sample risk group (along with Survival time and Survival status) generated by median dichotomising risk scores (training set) estimated through the multivariate Cox proportional hazards model

coxph__all_training,all_validation__TRAINING_$training.datasets__backward,forward,glm,randomforest__model_1,2,3__top_$top.n.txt

Contains univariate Cox proportional hazards model results with risk group (derived from multivariate Cox model) as the explanatory variable

beta__TRAINING_$training.datasets__backward,forward__model_1,2,3__top_$top.n.txt

Contains the fitted coefficients (betas) of the final model following backward elimination and forward selection (separately)

beta__TRAINING_$training.datasets__glm__model_1,2,3__top_$top.n.txt

Contains the fitted coefficients (betas) of the final model following GLMs (LASSO, Ridge or Elastic Nets) driven feature selection

vimp__TRAINING_$training.datasets__randomforest__model_1,2,3__top_$top.n.txt

Contains the variable importance of random forest.

OOB_error__TRAINING_$training.datasets__randomforest__model_1,2,3__top_$top.n.txt

Contains OOB error rate against number of trees to help identify stablisation point for ‘rf.ntree’ parameter

### graphs/

KM_uv__all_training,all_validation__TRAINING_$training.datasets__model_1,2,3__SubnetworkName__truncated_$truncate.survival.png

Kaplan-Meier analysis of risk groups generated for each subnetwork through univariate Cox proportional hazards model. These will be generated if parameter plot.univariate.data was set to TRUE in create.survivalplots() as default value is set to FALSE.

KM__all_training,all_validation__TRAINING_$traning.datasets__backward,forward,glm,randomforest__model_1,2,3__top_$top.n__truncated_\$truncate.survival.png

Kaplan-Meier analysis of risk groups generated through multivariate Cox proportional hazards model