Analyzing Proteomics UPS1 Spike-in Experiments (Example Ramus 2016 Dataset)

Wolfgang Raffelsberger

2024-07-26

Introduction

This vignette complements the more basic vignette ‘Getting started with wrProteo’ also from this package (wrProteo) and shows in more detail how UPS1_spike-in_ experiments may be analyzed, using this package (wrProteo).

Furthermore, wrMisc, wrGraph and RColorBrewer from CRAN as well as the Bioconductor package limma (for it’s moderated statistical testing) will be used internally.

So, to get started on a fresh session of R, you might have to install the following packages:

## This is R code, you can run this to redo all analysis presented here.
install.packages("wrMisc")
## These packages are used for the graphics
install.packages("wrGraph")
install.packages("RColorBrewer")
if(!requireNamespace("knitr", quietly=TRUE)) install.packages("knitr")

## Installation of limma from Bioconductor
if(!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager")
BiocManager::install("limma")

## now all dependecies are installed...
install.packages("wrProteo")

## You cat also see all vignettes for this package by typing :
browseVignettes("wrProteo")    #  ... and the select the html output

As you will see in the interactive window from browseVignettes(), this package has 2 vignettes, a more general introductory vignette (mentioned above) and this UPS1 dedicated vignette.

Now let’s load the packages needed :

## Let's assume this is a fresh R-session
library(knitr)
library(wrMisc)
library(wrGraph)
library(wrProteo)

# Version number for wrProteo :
packageVersion("wrProteo")
#> [1] '1.12.0'

Experimental Setup For Benchmark Tests

The main aim of the experimental setup using heterologous spike-in experiments is to provide a framework to test identification and quantitation procedures in proteomics. The overall idea is based on providing samples where the amount of a few proteins vary in a very controlled way, ie it is exactely known in advance which proteins vary how much. The easiest way to obtain samples consists in taking of advantage that proteins from different species vary many times enough so that mass spectrometry experiments can distinguish the species origin.

The exeriment reanalyzed here used a base (‘matrix’) of yeast protein extract constant in all samples. Then, varying amounts of a commerical collection of 48 purified human proteins were added in different well documented amounts to the constant yeast protein extract. For this purpose the UPS1 preparation, commerically available from Sigma-Aldrich, is frequently used.

In terms of ROC curves (see also ROC on Wikipedia) the spike-in proteins are expected to show up as true positives (TP). In contrast, since all yeast proteins were added in the same quantity to the same samples, they should be observed as constant, ie as true negatives (TN) when looking for proteins changing abundance.

The specific dataset used here (seen also next section Ramus Data Set) is not so recent and better performing mass spectrometers have gotten availabale in the meantime. Thus, for addressing scientific questions concerning comparison and choice of quantification software it is suggetsed to perform similar comparisons on more recent datasets. The main aim of this vignette is to show the possibilities of how such comparisons can be performed using wrProteo.

The Ramus Data-Set

The data used in this vignette was published with the article : Ramus et al 2016 “Benchmarking quantitative label-free LC-MS data processing workflows using a complex spiked proteomic standard dataset” in J Proteomics 2016 Jan 30;132:51-62.

This dataset is available on PRIDE as PXD001819 (and on ProteomeXchange).

Briefly, this experiment aims to evaluate and compare various quantification appoaches of the heterologous spike-in UPS1 (available from Sigma-Aldrich) in yeast protein extracts as constant matrix. 9 different concentrations of the heterologous spike-in (UPS1) were run in triplicates. The proteins were initially digested by Trypsin and then analyzed by LC-MS/MS in DDA mode.

As described in more detail in the reference, this dataset was generated using a LTQ-Orbitrap, in the meantime more powerful and precises mass-spectrometers have become avialable. Thus, scientific questions about the comparison and choice of quantification software may be better addressed using more recent datasets.

Meta-Data Describing The Experiment (sdrf)

The project Proteomics Sample Metadata Format aims to provide a framework of providing a uniform format for documenting experimental meta-data (sdrf-format). The meta-data for experiments already integrated can be directly read/accessed from wrProteo.

Either you download the meta-data as file ‘sdrf.tsv’ from Pride/PXD001819, or you may read file ‘PXD001819.sdrf.tsv’ directly from github/bigbio.

## Read meta-data from  github.com/bigbio/proteomics-metadata-standard/
pxd001819meta <- readSdrf("PXD001819")
#> readSdrf : Successfully read 24 annotation columns for 27 samples

## The concentration of the UPS1 spike-in proteins in the samples
if(length(pxd001819meta) >0) {
  UPSconc <- sort(unique(as.numeric(wrMisc::trimRedundText(pxd001819meta$characteristics.spiked.compound.))))  # trim to get to 'essential' info
} else {
  UPSconc <- c(50, 125, 250, 500, 2500, 5000, 12500, 25000, 50000)       # in case access to github failed
}

The import function used furtheron in this vignette can directly download this metadata if the PXD-accession-number is provided.

Key Elements And Additional Functions

## A few elements and functions we'll need lateron
methNa <- c("ProteomeDiscoverer","MaxQuant","Proline")
names(methNa) <- c("PD","MQ","PL")

## The accession numbers for the UPS1 proteins
UPS1 <- data.frame(ac=c("P00915", "P00918", "P01031", "P69905", "P68871", "P41159", "P02768", "P62988",
  "P04040", "P00167", "P01133", "P02144", "P15559", "P62937", "Q06830", "P63165",
  "P00709", "P06732", "P12081", "P61626", "Q15843", "P02753", "P16083", "P63279",
  "P01008", "P61769", "P55957", "O76070", "P08263", "P01344", "P01127", "P10599",
  "P99999", "P06396", "P09211", "P01112", "P01579", "P02787", "O00762", "P51965",
  "P08758", "P02741", "P05413", "P10145", "P02788", "P10636-8", "P00441", "P01375"),
  species=rep("Homo sapiens", 48),
  name=NA)
## additional functions
replSpecType <- function(x, annCol="SpecType", replBy=cbind(old=c("mainSpe","species2"), new=c("Yeast","UPS1")), silent=TRUE) {
  ## rename $annot[,"SpecType"] to more specific names
  chCol <- annCol[1] %in% colnames(x$annot)
  if(chCol) { chCol <- which(colnames(x$annot)==annCol[1])
    chIt <- replBy[,1] %in% unique(x$annot[,chCol])    # check items to replace if present
    if(any(chIt)) for(i in which(chIt)) {useLi <- which(x$annot[,chCol] %in% replBy[i,1]); cat("useLi",head(useLi),"\n"); x$annot[useLi,chCol] <- replBy[i,2]}
  } else if(!silent) message(" replSpecType: 'annCol' not found in x$annot !")
  x }

plotConcHist <- function(mat, ref, refColumn=3:4, matCluNa="cluNo", lev=NULL, ylab=NULL, tit=NULL) {
  ## plot histogram like counts of UPS1 concentrations
  if(is.null(tit)) tit <- "Frequency of UPS1 Concentrations Appearing in Cluster"
  gr <- unique(mat[,matCluNa])
  ref <- ref[,refColumn]
  if(length(lev) <2) lev <- sort(unique(as.numeric(as.matrix(ref))))
  if(length(ylab) !=1) ylab <- "Frequency"
  tbl <- table(factor( as.numeric(ref[which(rownames(ref) %in% rownames(mat)),]), levels=lev))
  graphics::barplot(tbl, las=1, beside=TRUE, main=paste(tit,gr), col=grDevices::gray(0.8), ylab=ylab)
}

plotMultRegrPar <- function(dat, methInd, tit=NULL, useColumn=c("logp","slope","medAbund","startFr"), lineGuide=list(v=c(-12,-10),h=c(0.7,0.75),col="grey"), xlim=NULL,ylim=NULL,subTit=NULL) {
  ## scatter plot logp (x) vs slope (y) for all UPS proteins, symbol by useColumn[4], color by hist of useColumn[3]
  ## dat (array) UPS1 data
  ## useColumn (character) 1st as 'logp', 2nd as 'slope', 3rd as median abundance, 4th as starting best regression from this point
  fxNa <- "plotMultRegrPar"
   #fxNa <- wrMisc::.composeCallName(callFrom,newNa="plotMultRegrPar")
  if(length(dim(dat)) !=3) stop("invalid input, expecting as 'dat' array with 3 dimensions (proteins,Softw,regrPar)")
  if(any(length(methInd) >1, methInd > dim(dat)[2], !is.numeric(methInd))) stop("invalid 'methInd'")
  chCol <- useColumn %in% dimnames(dat)[[3]]
  if(any(!chCol)) stop("argument 'useColumn' does not fit to 3rd dim dimnames of 'dat'")
  useCol <- colorAccording2(dat[,methInd,useColumn[3]], gradTy="rainbow", revCol=TRUE, nEndOmit=14)
  graphics::plot(dat[,methInd,useColumn[1:2]], main=tit, type="n",xlim=xlim,ylim=ylim)   #col=1, bg.col=useCol, pch=20+lmPDsum[,"startFr"],
  graphics::points(dat[,methInd,useColumn[1:2]], col=1, bg=useCol, pch=20+dat[,methInd,useColumn[4]],)
  graphics::legend("topright",paste("best starting from ",1:5), text.col=1, pch=21:25, col=1, pt.bg="white", cex=0.9, xjust=0.5, yjust=0.5)
  if(length(subTit)==1) graphics::mtext(subTit,cex=0.9)
  if(is.list(lineGuide) & length(lineGuide) >0) {if(length(lineGuide$v) >0) graphics::abline(v=lineGuide$v,lty=2,col=lineGuide$col)
    if(length(lineGuide$h) >0) graphics::abline(h=lineGuide$h,lty=2,col=lineGuide$col)}
  hi1 <- graphics::hist(dat[,methInd,useColumn[3]], plot=FALSE)
  wrGraph::legendHist(sort(dat[,methInd,useColumn[3]]), colRamp=useCol[order(dat[,methInd,useColumn[3]])][cumsum(hi1$counts)],
    cex=0.5, location="bottomleft", legTit="median raw abundance")  #
}

Protein Identification and Initial Quantification

Multiple algorithms and software implementations have been developed for quantitation label-free proteomics experiments, in particular for extracted ion chromatograms (XIC). For background information you may look at Wikipedia labell-free Proteomics. Here, the use of the output for 3 such implementations for extracting peptide/protein quantifications is shown. These 3 software implementations were run individually using equivalent settings, ie identifcation based on the same fasta-database, starting at a single peptide with 1% FDR, MS mass tolerance for ion precursors at 0.7 ppm, oxidation of methionins and N-terminal acetylation as fixed as well as carbamidomethylation of cysteins as variable modifications.

Since in this context it is crucial to recognize all UPS1 proteins as such (see also this data-set), the import-functions make use of the specPref argument, allowing to define custom tags. Most additional arguments to the various import-functions have been kept common for conventient use and for generating output structured the same way. Indeed, simply separating proteins by their species origin is not sufficient since common contaminants like human Keratin might get considered by error as UPS1.

MaxQuant

MaxQuant is free software provided by the Max-Planck-Institute, see also Tyanova et al 2016. Later in this document data from MaxQuant will by frequently abbreviated as MQ.

Typically MaxQuant exports quantitation data on level of consensus-proteins by default to a folder called txt with a file called “proteinGroups.txt” . So in a standard case (when the file name has not been changed manually) it is sufficient to provide the path to this file. Of course, you can explicitely point to a specific file, as shown below. With the data presented here MaxQuant version 1.6.10 was run. Files compressed as .gz can be read, too (like in the example below).

path1 <- system.file("extdata", package="wrProteo")
fiNaMQ <- "proteinGroups.txt.gz"

## We need to define the setup of species
specPrefMQ <- list(conta="CON_|LYSC_CHICK", mainSpecies="OS=Saccharomyces cerevisiae", spike=UPS1$ac, sampleNames="sdrf")
dataMQ <- readMaxQuantFile(path1, file=fiNaMQ, specPref=specPrefMQ, refLi="mainSpe",
  sdrf=c("PXD001819","max",sdrfOrder=TRUE), suplAnnotFile=TRUE, plotGraph=FALSE)
#> readMaxQuantFile : Note: Found 11 out of 1115 proteins marked as 'REV_' (reverse peptide identification) - Removing
#> readMaxQuantFile : Transform 2845(9.5%) initial '0' values to 'NA'
#> readMaxQuantFile : Could not find peptide counts columns (argument 'pepCountCol') matching to 'Unique peptides MS\.MS\.count '
#> readMaxQuantFile : Found 1 species name(s) appearing inside other ones, assume as truncated (eg  Saccharomyces cerevis)
#> readMaxQuantFile : Note: 5 proteins with unknown species
#>      data by species : Gallus gallus: 1,  Homo sapiens: 49,  Mus musculus: 1,  Saccharomyces cerevisiae: 1047,  Sus scrofa: 1,
#> readMaxQuantFile : Found 70 composite accession-numbers (eg P00761;P00761), truncating
#> readMaxQuantFile : Use column 'Accession' as identifyer (has fewest, ie 0 duplicated entries) as rownames
#> readMaxQuantFile : Could not find any proteins matching argument 'refLi=mainSpe', ignoring ...
#> readMaxQuantFile : normalizeThis : Omit redundant 'refLines'
#> readMaxQuantFile : readSampleMetaData : PROBLEM : Meta-data and abundance data do not match !  Number of samples from summary.txt.gz (27) and from main data (27) do NOT match !! .. ignoring
#> readMaxQuantFile : readSampleMetaData : readSdrf : Successfully read 24 annotation columns for 27 samples
#> readMaxQuantFile : readSampleMetaData : Note : Some filenames contain '.raw', others do NOT; solved inconsistency ..
#> readMaxQuantFile : readSampleMetaData : Successfully adjusted order of sdrf to content of summary.txt.gzparameters.txt.gz
#> readMaxQuantFile : readSampleMetaData : Choosing model 'combNonOrth' for evaluating replicate-structure (ie 9 groups of samples)
#> readMaxQuantFile : Sucessfully re-adjusted levels after bringing in order of Sdrf

The data were imported, log2-transformed and median-normalized, the protein annotation was parsed to automatically extract IDs, protein-names and species information. The species anotation was extracted out of the fasta-headers, as given in the specPref argument (MaxQuant specific setting). As explained in more detail in the general vignette wrProteoVignette1, In this example we use only proteins annotated as Homo sapiens for determining the normalization-factors via the argument refLi.

If you wish to inspect the graphs for the distribution of abundance values for each sample before and after median-normalization, please set the argument plotGraph=TRUE (default). Please note, that in the example above we directly added information about the experimental setup from the sdrf repository.

## The number of lines and colums
dim(dataMQ$quant)
#> [1] 1104   27
## A quick summary of some columns of quantitation data
summary(dataMQ$quant[,1:7])                # the first 8 cols
#>   12500amol_R1    12500amol_R2    12500amol_R3     125amol_R1   
#>  Min.   :17.52   Min.   :15.65   Min.   :14.91   Min.   :15.20  
#>  1st Qu.:22.49   1st Qu.:22.47   1st Qu.:22.48   1st Qu.:22.40  
#>  Median :23.45   Median :23.45   Median :23.45   Median :23.45  
#>  Mean   :23.67   Mean   :23.64   Mean   :23.65   Mean   :23.62  
#>  3rd Qu.:24.80   3rd Qu.:24.74   3rd Qu.:24.75   3rd Qu.:24.84  
#>  Max.   :30.27   Max.   :30.25   Max.   :30.30   Max.   :30.27  
#>  NA's   :98      NA's   :100     NA's   :104     NA's   :114    
#>    125amol_R2      125amol_R3     25000amol_R1  
#>  Min.   :14.88   Min.   :14.94   Min.   :15.74  
#>  1st Qu.:22.39   1st Qu.:22.41   1st Qu.:22.44  
#>  Median :23.45   Median :23.45   Median :23.45  
#>  Mean   :23.62   Mean   :23.63   Mean   :23.65  
#>  3rd Qu.:24.84   3rd Qu.:24.80   3rd Qu.:24.86  
#>  Max.   :30.28   Max.   :30.30   Max.   :30.19  
#>  NA's   :106     NA's   :114     NA's   :109
table(dataMQ$annot[,"SpecType"], useNA="always")
#> 
#>       conta mainSpecies       spike        <NA> 
#>           9        1047          48           0

Now we can summarize the presence of UPS1 proteins after treatment by MaxQuant : In sum, 48 UPS1 proteins were found, 0 are missing.

ProteomeDiscoverer

ProteomeDiscoverer is commercial software from ThermoFisher (www.thermofisher.com). Later in this document data from ProteomeDiscoverer will by frequently abbreviated as PD.

With the data (see also this data-set) used here, the identification was performed using the XCalibur module of ProteomeDiscoverer version 2.4 . Quantitation data at the level of consensus-proteins can be exported to tabulated text files, which can be treated by the function shown below. The resultant data were export in tablulated format and the file automatically named ‘_Proteins.txt_’ by ProteomeDiscoverer (the option R-headers was checked, however this option is not mandatory). Files compressed as .gz can be read, too (like in the example below).

path1 <- system.file("extdata", package="wrProteo")
fiNaPd <- "pxd001819_PD24_Proteins.txt.gz"
## Next, we define the setup of species
specPrefPD <- list(conta="Bos tauris|Gallus", mainSpecies="Saccharomyces cerevisiae", spike=UPS1$ac, sampleNames="sdrf")
dataPD <- readProteomeDiscovererFile(file=fiNaPd, path=path1, refLi="mainSpe", specPref=specPrefPD,
  sdrf=c("PXD001819", "max", sdrfOrder=TRUE), plotGraph=FALSE)
#> readProteomeDiscovererFile : Adding supl annotation-columns 'Number.of.Unique.Peptides', 'NA', 'NA', 'NA', 'NA' and 'Coverage.in.Percent'
#> readProteomeDiscovererFile : Note : 1 proteins/peptide(s) were marked (in addition) as contaminants based on 'specPref'
#> readProteomeDiscovererFile : Note : 1 contaminant protein(s)/peptide(s) will be removed, 1296 remain
#> readProteomeDiscovererFile : Count by 'Species' : Homo sapiens: 44 ;  Saccharomyces cerevisiae: 1239 ;  na: 13
#> readProteomeDiscovererFile : Removing 1 lines due to duplicated Accessions (typically due to contaminants)
#> readProteomeDiscovererFile : Normalize using (custom) subset of 1239 lines specified as 'mainSpe'
#> readProteomeDiscovererFile : readSampleMetaData : readSdrf : Successfully read 24 annotation columns for 27 samples
#> readProteomeDiscovererFile : readSampleMetaData : Successfully adjusted order of sdrf to content of pxd001819_PD_InputFiles.txt.gz
#> readProteomeDiscovererFile : readSampleMetaData : Choosing model 'combNonOrth' for evaluating replicate-structure (ie 9 groups of samples)
#> readProteomeDiscovererFile : Sucessfully re-adjusted levels after bringing in order of Sdrf

The data were imported, log2-transformed and median-normalized, the protein annotation was parsed to automatically extract IDs, protein-names and species information. Please note, that quantitation data exported from ProteomeDiscoverer frequently have very generic column-names (increasing numbers). When calling the import-function they can be replaced by more meaningful names either using the argument sampNa, or from reading the default annotation in the file ‘InputFiles.txt’ or, finally, from the sdrf-annotation. In the example below both the default annotation as file ‘InputFiles.txt’ and sdrf annotation are available and were integrated to object produced by the import-function.

The species anotation was extracted out as given in the specPref argument. In this example we use only proteins annotated as Homo sapiens for determining the normalization-factors via the argument refLi.

If you wish to inspect the graphs for the distribution of abundance values for each sample before and after median-normalization, please set the argument plotGraph=TRUE (default).

## The number of lines and colums
dim(dataPD$quant)
#> [1] 1295   27
## A quick summary of some columns of quantitation data
summary(dataPD$quant[,1:7])        # the first 8 cols
#>   12500amol_R1    12500amol_R2    12500amol_R3     125amol_R1   
#>  Min.   :10.86   Min.   :11.33   Min.   :11.57   Min.   :10.90  
#>  1st Qu.:18.33   1st Qu.:18.32   1st Qu.:18.31   1st Qu.:18.23  
#>  Median :19.48   Median :19.46   Median :19.46   Median :19.37  
#>  Mean   :19.62   Mean   :19.62   Mean   :19.60   Mean   :19.52  
#>  3rd Qu.:20.85   3rd Qu.:20.83   3rd Qu.:20.80   3rd Qu.:20.84  
#>  Max.   :26.36   Max.   :26.36   Max.   :26.39   Max.   :26.43  
#>  NA's   :69      NA's   :59      NA's   :67      NA's   :88     
#>    125amol_R2      125amol_R3     25000amol_R1  
#>  Min.   :10.48   Min.   :11.16   Min.   :11.55  
#>  1st Qu.:18.20   1st Qu.:18.20   1st Qu.:18.34  
#>  Median :19.40   Median :19.40   Median :19.54  
#>  Mean   :19.52   Mean   :19.53   Mean   :19.67  
#>  3rd Qu.:20.82   3rd Qu.:20.83   3rd Qu.:21.01  
#>  Max.   :26.39   Max.   :26.46   Max.   :26.40  
#>  NA's   :99      NA's   :85      NA's   :64
table(dataPD$annot[,"SpecType"], useNA="always")
#> 
#> mainSpecies       spike        <NA> 
#>        1239          48           8

Confirming the presence of UPS1 proteins by ProteomeDiscoverer:

Now we can summarize the presence of UPS1 proteins after treatment by ProteomeDiscoverer : In sum, 48 UPS1 proteins were found, 0 are missing.

Proline

Proline is open-source software provided by the Profi-consortium (see also proline-core on github), published by Bouyssie et al 2020. Later in this document data from Proline will by frequently abbreviated as PL.

Protein identification in Proline gets performed by SearchGUI, see also Vaudel et al 2015. In this case X!Tandem (see also Duncan et al 2005) was used as search engine.

Quantitation data at the level of consensus-proteins can be exported from Proline as .xlsx or tabulated text files, both formats can be treated by the import-functions shown below. Here, Proline version 1.6.1 was used with addition of Percolator (via MS-Angel from the same authors).

path1 <- system.file("extdata", package="wrProteo")
fiNaPl <- "pxd001819_PL.xlsx"

specPrefPL <- list(conta="_conta", mainSpecies="Saccharomyces cerevisiae", spike=UPS1$ac, sampleNames="sdrf")
dataPL <- readProlineFile(fiNaPl, path=path1, specPref=specPrefPL, normalizeMeth="median", refLi="mainSpe",
  sdrf=c("PXD001819", "max", sdrfOrder=TRUE), plotGraph=FALSE)
#> readProlineFile : Found sheets 'Search settings and infos', 'Import and filters', 'Quant config' and 'Protein sets'
#> readProlineFile : Normalize using (custom) subset of 1137 lines specified as 'mainSpe'
#> readProlineFile : readSampleMetaData : replicateStructure : Need to correct redundant colnames !!!
#> readProlineFile : readSampleMetaData : readSdrf : Successfully read 24 annotation columns for 27 samples
#> readProlineFile : readSampleMetaData : Note : Some filenames contain '.raw', others do NOT; solved inconsistency ..
#> readProlineFile : readSampleMetaData : Successfully adjusted order of sdrf to content of pxd001819_PL.xlsx
#> readProlineFile : readSampleMetaData : Choosing model 'combNonOrth' for evaluating replicate-structure (ie 9 groups of samples)
#> readProlineFile : Sucessfully re-adjusted levels after bringing in order of Sdrf

The (log2-transformed) data were imported and median-normalized, the protein annotation was parsed to automatically extract IDs, protein-names and species information. The species anotation was extracted out of protein annotation columns, as specified with the specPref argument. As explained in more detail in the general vignette wrProteoVignette1, In this example we use only proteins annotated as Homo sapiens for determining the normalization-factors via the argument refLi.

If you wish to inspect the graphs for the distribution of abundance values for each sample before and after median-normalization, please set the argument plotGraph=TRUE (default). Please note, that in the example above we directly added information about the experimental setup from the sdrf repository.

## The number of lines and colums
dim(dataPL$quant)
#> [1] 1186   27
## A quick summary of some columns of quantitation data
summary(dataPL$quant[,1:8])        # the first 8 cols
#>   12500amol_R1    12500amol_R2    12500amol_R3     125amol_R1   
#>  Min.   :14.21   Min.   :14.08   Min.   :13.24   Min.   :14.43  
#>  1st Qu.:19.94   1st Qu.:19.95   1st Qu.:19.97   1st Qu.:19.76  
#>  Median :21.75   Median :21.76   Median :21.78   Median :21.65  
#>  Mean   :21.80   Mean   :21.83   Mean   :21.82   Mean   :21.75  
#>  3rd Qu.:23.63   3rd Qu.:23.65   3rd Qu.:23.65   3rd Qu.:23.70  
#>  Max.   :29.42   Max.   :29.45   Max.   :29.43   Max.   :29.48  
#>  NA's   :43      NA's   :43      NA's   :45      NA's   :46     
#>    125amol_R2      125amol_R3     25000amol_R1    25000amol_R2  
#>  Min.   :13.54   Min.   :14.77   Min.   :14.13   Min.   :14.83  
#>  1st Qu.:19.77   1st Qu.:19.76   1st Qu.:19.98   1st Qu.:19.95  
#>  Median :21.64   Median :21.65   Median :21.81   Median :21.83  
#>  Mean   :21.73   Mean   :21.73   Mean   :21.85   Mean   :21.85  
#>  3rd Qu.:23.68   3rd Qu.:23.65   3rd Qu.:23.69   3rd Qu.:23.69  
#>  Max.   :29.47   Max.   :29.42   Max.   :29.40   Max.   :29.44  
#>  NA's   :46      NA's   :48      NA's   :49      NA's   :48
table(dataPL$annot[,"SpecType"], useNA="always")
#> 
#>       conta mainSpecies       spike        <NA> 
#>           1        1137          48           0

Now we can summarize the presence of UPS1 proteins after treatment by Proline : In sum, 48 UPS1 proteins were found, 0 are missing.

Uniform Re-Arranging Of Data

For easy and proper comparisons we need to make sure all columns are in the same order, since we have forced to use the initial order of the Sdrf, this is already the case. However, this order does not correspond to the incrasing order of UPS1-doses. To facilitate later steps, we’ll bring them in this increasing order of UPS1-doses.

## bring all results (MaxQuant,ProteomeDiscoverer, ...) in same ascending order
## as reference will use the order from ProteomeDiscoverer, it's output is already in a convenient order
sampNa <- colnames(dataPD$quant)[order(dataPD$sampleSetup$level)]

## it is more convenient to re-order columns this way in each project
dataMQ <- corColumnOrder(dataMQ, sampNames=sampNa)         
dataPD <- corColumnOrder(dataPD, sampNames=sampNa)         
dataPL <- corColumnOrder(dataPL, sampNames=sampNa)         

At import we made use of the argument specPref (specifying ‘mainSpecies’, ‘conta’ and ‘spike’) which allows to build categories based on searching keywords based on the initial annotation. In turn, we obtain the labels : ‘main Spe’ for yeast (ie matrix), ‘species2’ for the UPS1 (ie spike) ‘conta’ for contaminants. Let’s replace the first two generic terms by more specific ones (ie ‘Yeast’ and ‘UPS1’) :

## Need to rename $annot[,"SpecType"]
dataPD <- replSpecType(dataPD, replBy=cbind(old=c("mainSpe","species2"), new=c("Yeast","UPS1")))
dataMQ <- replSpecType(dataMQ, replBy=cbind(old=c("mainSpe","species2"), new=c("Yeast","UPS1")))
dataPL <- replSpecType(dataPL, replBy=cbind(old=c("mainSpe","species2"), new=c("Yeast","UPS1")))

## Need to address missing ProteinNames (UPS1) due to missing tags in Fasta
dataPD <- replMissingProtNames(dataPD)
#> replreplMissingProtNames :  ..trying to replace 1295 'EntryName'
dataMQ <- replMissingProtNames(dataMQ)
#> replreplMissingProtNames :  ..trying to replace 5 'EntryName'
dataPL <- replMissingProtNames(dataPL)
    table(dataPD$annot[,"SpecType"])
#> 
#> mainSpecies       spike 
#>        1239          48

## synchronize order of groups
(grp9 <- dataMQ$sampleSetup$level)
#>    50    50    50   125   125   125   250   250   250   500   500   500  2500 
#>     1     1     1     2     2     2     3     3     3     4     4     4     5 
#>  2500  2500  5000  5000  5000 12500 12500 12500 25000 25000 25000 50000 50000 
#>     5     5     6     6     6     7     7     7     8     8     8     9     9 
#> 50000 
#>     9
names(grp9) <- paste0(names(grp9),"amol")
dataPL$sampleSetup$groups <- dataMQ$sampleSetup$groups <- dataPD$sampleSetup$groups <- grp9  # synchronize order of groups
## extract names of quantified UPS1-proteins
NamesUpsPD <- dataPD$annot[which(dataPD$annot[,"SpecType"]=="spike"), "Accession"]
NamesUpsMQ <- dataMQ$annot[which(dataMQ$annot[,"SpecType"]=="spike"), "Accession"]
NamesUpsPL <- dataPL$annot[which(dataPL$annot[,"SpecType"]=="spike"), "Accession"]
tabS <- mergeVectors(PD=table(dataPD$annot[,"SpecType"]), MQ=table(dataMQ$annot[,"SpecType"]), PL=table(dataPL$annot[,"SpecType"]))
tabT <- mergeVectors(PD=table(dataPD$annot[,"Species"]), MQ=table(dataMQ$annot[,"Species"]), PL=table(dataPL$annot[,"Species"]))
tabS[which(is.na(tabS))] <- 0
tabT[which(is.na(tabT))] <- 0
kable(cbind(tabS[,2:1], tabT), caption="Number of proteins identified, by custom tags, species and software")
Number of proteins identified, by custom tags, species and software
mainSpecies conta Gallus gallus Homo sapiens Mus musculus Saccharomyces cerevisiae Sus scrofa
PD 1239 0 0 44 0 1239 0
MQ 1047 9 1 49 1 1047 1
PL 1137 1 0 48 0 1137 1

The initial fasta file also contained the yeast strain number, this has been stripped off when using default parameters.


Basic Data Treatment

Structure of Experiment

The global structure of experiments can be provided as sdrf-file and/or from meta-data stored with the experimental data read. For convenience, this information about the groups of replicates was already deduced and can be found (for example) in dataMQ\(sampleSetup\)sdrf.

kable(cbind(dataMQ$sampleSetup$sdrf[,c(23,7,19,22)], groups=dataMQ$sampleSetup$groups))
comment.data.file. characteristics.spiked.compound. comment.modification.parameters..2 comment.file.uri. groups
25 UPS1_50amol_R1.raw CT=mixture;QY=50 amol;CN=UPS1;CV=Standards Research Group NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_50amol_R1.raw 1
26 UPS1_50amol_R2.raw CT=mixture;QY=50 amol;CN=UPS1;CV=Standards Research Group NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_50amol_R2.raw 1
27 UPS1_50amol_R3.raw CT=mixture;QY=50 amol;CN=UPS1;CV=Standards Research Group NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_50amol_R3.raw 1
4 UPS1_125amol_R1.raw CT=mixture;QY=125 amol;CN=UPS1;CV=Standards Research Group NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_125amol_R1.raw 2
5 UPS1_125amol_R2.raw CT=mixture;QY=125 amol;CN=UPS1;CV=Standards Research Group NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_125amol_R2.raw 2
6 UPS1_125amol_R3.raw CT=mixture;QY=125 amol;CN=UPS1;CV=Standards Research Group NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_125amol_R3.raw 2
13 UPS1_250amol_R1.raw CT=mixture;QY=250 amol;CN=UPS1;CV=Standards Research Group NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_250amol_R1.raw 3
14 UPS1_250amol_R2.raw CT=mixture;QY=250 amol;CN=UPS1;CV=Standards Research Group NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_250amol_R2.raw 3
15 UPS1_250amol_R3.raw CT=mixture;QY=250 amol;CN=UPS1;CV=Standards Research Group NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_250amol_R3.raw 3
22 UPS1_500amol_R1.raw CT=mixture;QY=500 amol;CN=UPS1;CV=Standards Research Group NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_500amol_R1.raw 4
23 UPS1_500amol_R2.raw CT=mixture;QY=500 amol;CN=UPS1;CV=Standards Research Group NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_500amol_R2.raw 4
24 UPS1_500amol_R3.raw CT=mixture;QY=500 amol;CN=UPS1;CV=Standards Research Group NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_500amol_R3.raw 4
10 UPS1_2500amol_R1.raw CT=mixture;QY=2500 amol;CN=UPS1;CV=Standards Research Group NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_2500amol_R1.raw 5
11 UPS1_2500amol_R2.raw CT=mixture;QY=2500 amol;CN=UPS1;CV=Standards Research Group NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_2500amol_R2.raw 5
12 UPS1_2500amol_R3.raw CT=mixture;QY=2500 amol;CN=UPS1;CV=Standards Research Group NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_2500amol_R3.raw 5
19 UPS1_5000amol_R1.raw CT=mixture;QY=5000 amol;CN=UPS1;CV=Standards Research Group NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_5000amol_R1.raw 6
20 UPS1_5000amol_R2.raw CT=mixture;QY=5000 amol;CN=UPS1;CV=Standards Research Group NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_5000amol_R2.raw 6
21 UPS1_5000amol_R3.raw CT=mixture;QY=5000 amol;CN=UPS1;CV=Standards Research Group NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_5000amol_R3.raw 6
1 UPS1_12500amol_R1.raw CT=mixture;QY=12500 amol;CN=UPS1;CV=Standards Research Group NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_12500amol_R1.raw 7
2 UPS1_12500amol_R2.raw CT=mixture;QY=12500 amol;CN=UPS1;CV=Standards Research Group NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_12500amol_R2.raw 7
3 UPS1_12500amol_R3.raw CT=mixture;QY=12500 amol;CN=UPS1;CV=Standards Research Group NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_12500amol_R3.raw 7
7 UPS1_25000amol_R1.raw CT=mixture;QY=25000 amol;CN=UPS1;CV=Standards Research Group NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_25000amol_R1.raw 8
8 UPS1_25000amol_R2.raw CT=mixture;QY=25000 amol;CN=UPS1;CV=Standards Research Group NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_25000amol_R2.raw 8
9 UPS1_25000amol_R3.raw CT=mixture;QY=25000 amol;CN=UPS1;CV=Standards Research Group NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_25000amol_R3.raw 8
16 UPS1_50000amol_R1.raw CT=mixture;QY=50000 amol;CN=UPS1;CV=Standards Research Group NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_50000amol_R1.raw 9
17 UPS1_50000amol_R2.raw CT=mixture;QY=50000 amol;CN=UPS1;CV=Standards Research Group NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_50000amol_R2.raw 9
18 UPS1_50000amol_R3.raw CT=mixture;QY=50000 amol;CN=UPS1;CV=Standards Research Group NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_50000amol_R3.raw 9

Normalization

No additional normalization is needed, all data were already median normalized to the host proteins (ie Saccharomyces cerevisiae) after importing the initial quantification-output using ‘readMaxQuantFile()’, ‘readProlineFile()’ and ‘readProteomeDiscovererFile()’.

Presence of NA-values

As mentioned in the general vignette of this package, ‘wrProteoVignette1’, it is important to investigate the nature of NA-values. In particular, checking the hypothesis that NA-values originate from very low abundance instances is very important for deciding how to treat NA-values furtheron.

## Let's inspect NA values from ProteomeDiscoverer as graphic
matrixNAinspect(dataPD$quant, gr=grp9, tit="ProteomeDiscoverer")
#> stableMode : Method='density',  length of x =970, 'bandw' has been set to 44

## Let's inspect NA values from MaxQuant as graphic
matrixNAinspect(dataMQ$quant, gr=grp9, tit="MaxQuant")
#> stableMode : Method='density',  length of x =1142, 'bandw' has been set to 47

## Let's inspect NA values from Proline as graphic
matrixNAinspect(dataPL$quant, gr=grp9, tit="Proline")
#> stableMode : Method='density',  length of x =413, 'bandw' has been set to 28

A key element to understand the nature of NA-value is to investigate their NA-neighbours. If a given protein has for just one of the 3 replicates an NA, the other two valid quantifications can be considered as NA-neighbours. In the figures above all NA-neighbours are shown in the histogram and their mode is marked by an arrow. One can see, that NA-neighbours are predominantely (but not exclusively) part of the lower quantitation values. This supports the hypothesis that NAs occur most frequently with low abundance proteins.

NA-Imputation and Statistical Testing for Changes in Abundance

NA-values represent a challange for statistical testing. In addition, techniques like PCA don’t allow NAs, neither.

The number of NAs varies between samples : Indeed, very low concentrations of UPS1 are difficult to get detected and contribute largely to the NAs (as we will see later in more detail). Since the amout of yeast proteins (ie the matrix in this setup) stays constant across all samples, yeast proteins should always get detected the same way.

## Let's look at the number of NAs. Is there an accumulated number in lower UPS1 samples ?
tabSumNA <- rbind(PD=sumNAperGroup(dataPD$raw, grp9), MQ=sumNAperGroup(dataMQ$raw, grp9), PL=sumNAperGroup(dataPL$raw, grp9) )
kable(tabSumNA, caption="Number of NAs per group of samples", align="r")
Number of NAs per group of samples
1 2 3 4 5 6 7 8 9
PD 272 272 255 232 204 207 195 209 220
MQ 318 334 323 337 282 297 302 330 322
PL 124 140 157 140 137 139 131 141 131

In the section above we investigated the circumstances of NA-instances and provided evidence that NA-values typically represent proteins with low abundance which frequently ended up as non-detectable (NA). Thus, we hypothesize that (in most cases) NA-values might also have been detected in quantities like their NA-neighbours. In consequence, we will model a normal distribution based on the NA-neighbours and use for substituting.

The function testRobustToNAimputation() from this package (wrProteo) allows to perform NA-imputation and subsequent statistical testing (after repeated imputation) between all groups of samples (see also the general vignette). One of the advantages of this implementation, is that multiple rounds of imputation are run, so that final results (including pair-wise testing) get stabilized to (rare) stochastic effects. For this reason one may also speak of stabilized NA-imputations.

The statistical tests used underneith make use of the shrinkage-procedure provided from the empirical Bayes procedure as implemented to the Bioconductor package limma, see also Ritchie et al 2015. In addition, various formats of multiple testing correction can be added to the results : Benjamini-Hochberg FDR (lateron referred to as BH or BH-FDR, see FDR on Wikipedia, see also Benjamini and Hochberg 1995), local false discovery rate (lfdr, using the package fdrtool, see Strimmer 2008), or modified testing by ROTS, etc … In this vignette we will make use of the BH-FDR.

We are ready to launch the NA-imputation and testing for data from ProteomeDiscoverer. Please note, that the procedure including repetive NA-imputations may take a few seconds.

testPD <- testRobustToNAimputation(dataPD, imputMethod="informed")     # ProteomeDiscoverer
#> testRobustToNAimputation : matrixNAneighbourImpute : stableMode : Method='density',  length of x =970, 'bandw' has been set to 44
#> testRobustToNAimputation : matrixNAneighbourImpute :  n.woNA= 32899 , n.NA = 2066
#>     Imputing based on 'informed' using mode= 17.34 and sd= 0.5
#>     Note, mean for imputation is  below 0.05  quantile !!

Then for MaxQuant …

testMQ <- testRobustToNAimputation(dataMQ, imputMethod="informed")      # MaxQuant , ok
#> testRobustToNAimputation : matrixNAneighbourImpute : stableMode : Method='density',  length of x =1142, 'bandw' has been set to 47
#> testRobustToNAimputation : matrixNAneighbourImpute :  n.woNA= 26963 , n.NA = 2845
#>     Imputing based on 'informed' using mode= 21.49 and sd= 0.5
#>     Note, mean for imputation is  below 0.05  quantile !!

And finally for Proline :

testPL <- testRobustToNAimputation(dataPL, imputMethod="informed")      # Proline
#> testRobustToNAimputation : matrixNAneighbourImpute : stableMode : Method='density',  length of x =413, 'bandw' has been set to 28
#> testRobustToNAimputation : matrixNAneighbourImpute :  n.woNA= 30782 , n.NA = 1240
#>     Imputing based on 'informed' using mode= 18.74 and sd= 0.5
#>     Note, mean for imputation is  below 0.05  quantile !!

From these results we’ll use i) the NA-imputed version of our datasets for plotting principal components (PCA) and ii) the (stabilized) testing results for counting TP, FP, etc and to construct ROC curves.

Let’s add the NA-imputed data to our main object :

dataPD$datImp <- testPD$datImp       # recuperate imputeded data to main data-object
dataMQ$datImp <- testMQ$datImp
dataPL$datImp <- testPL$datImp

Analysis Using All Proteins Identified (Matrix + UPS1)

In this section we’ll consider all proteins identified and quantified in a pair-wise fashion, using the t-tests already run in the previous section. As mentioned, the experimental setup is very special, since all proteins that are truly changing are known in advance (the UPS1 spike-in proteins). Tables get constructed by counting based on various thresholds for considering given protein abundances as differential or not. A traditional 5 percent FDR cut-off is used for Volcano-plots, while ROC-curves allow inspecting the entire range of potential cut-off values.

Pairwise Testing Summary

A very universal and simple way to analyze data is by checking as several pairwise comparisons, in particular, if the experimental setup does not include complete multifactorial plans.

This UPS1 spike-in experiment (see also Experimental Setup) has 27 samples organized (according to meta-information) as 9 groups. Thus, one obtains in total 36 pair-wise comparisons which will make comparisons very crowded. The original publication by Ramus et al 2016 focussed on 3 pairwise comparisons only. In this vignette it is shown how all of them can get considered.

Now, we’ll construct a table showing all possible pairwise-comparisons. Using the function numPairDeColNames() we can easily extract the UPS1 concentrations as numeric content and show the (log-)ratio of the pairwise comparisons (column ‘log2rat’), the final concentrations (columns ‘conc1’ and ‘conc2’, in amol) and the number of differentially abundant proteins passing 5% FDR (using classical Benjamini-Hochberg FDR (columns ‘sig.xx.BH’) or lfdr (Strimmer 2008, columns ‘sig._xx_.lfdr’ ).

## The number of differentially abundant proteins passing 5% FDR (ProteomeDiscoverer and MaxQuant)
signCount <- cbind( sig.PD.BH=colSums(testPD$BH < 0.05, na.rm=TRUE), sig.PD.lfdr=if("lfdr" %in% names(testPD)) colSums(testPD$lfdr < 0.05, na.rm=TRUE),
  sig.MQ.BH=colSums(testMQ$BH < 0.05, na.rm=TRUE), sig.MQ.lfdr=if("lfdr" %in% names(testMQ)) colSums(testMQ$lfdr < 0.05, na.rm=TRUE),
  sig.PL.BH=colSums(testPL$BH < 0.05, na.rm=TRUE), sig.PL.lfdr=if("lfdr" %in% names(testPL)) colSums(testPL$lfdr < 0.05, na.rm=TRUE)  )

table1 <- numPairDeColNames(testPD$BH, stripTxt="amol", sortByAbsRatio=TRUE)
table1 <- cbind(table1, signCount[table1[,1],])
rownames(table1) <- colnames(testMQ$BH)[table1[,1]]

kable(table1, caption="All pairwise comparisons and number of significant proteins", align="c")
All pairwise comparisons and number of significant proteins
index log2rat conc1 conc2 sig.PD.BH sig.PD.lfdr sig.MQ.BH sig.MQ.lfdr sig.PL.BH sig.PL.lfdr
50000amol-50amol 34 9.966 50 50000 578 518 488 427 575 489
25000amol-50amol 31 8.966 50 25000 597 540 440 387 548 478
125amol-50000amol 12 8.644 125 50000 370 286 391 339 413 342
12500amol-50amol 29 7.966 50 12500 521 480 324 260 483 392
125amol-25000amol 3 7.644 125 25000 346 290 352 290 303 238
250amol-50000amol 15 7.644 250 50000 382 316 458 383 328 260
12500amol-125amol 1 6.644 125 12500 228 194 122 97 234 185
25000amol-250amol 9 6.644 250 25000 236 194 374 328 214 151
50000amol-500amol 27 6.644 500 50000 363 324 466 428 370 318
5000amol-50amol 35 6.644 50 5000 610 546 379 324 528 484
12500amol-250amol 7 5.644 250 12500 136 96 108 89 132 100
25000amol-500amol 24 5.644 500 25000 253 179 370 340 203 142
2500amol-50amol 32 5.644 50 2500 563 485 333 280 497 455
125amol-5000amol 17 5.322 125 5000 273 242 98 70 209 152
12500amol-500amol 22 4.644 500 12500 159 102 97 55 132 86
125amol-2500amol 5 4.322 125 2500 195 173 93 78 182 154
2500amol-50000amol 14 4.322 2500 50000 302 212 488 494 263 183
250amol-5000amol 20 4.322 250 5000 200 129 84 55 144 120
25000amol-2500amol 6 3.322 2500 25000 48 34 458 436 73 54
2500amol-250amol 10 3.322 250 2500 113 86 36 30 104 74
50000amol-5000amol 21 3.322 5000 50000 312 237 471 446 319 281
5000amol-500amol 28 3.322 500 5000 181 132 84 70 134 103
500amol-50amol 36 3.322 50 500 439 370 229 177 384 346
12500amol-2500amol 4 2.322 2500 12500 20 12 78 119 40 29
25000amol-5000amol 18 2.322 5000 25000 52 38 384 403 72 51
2500amol-500amol 25 2.322 500 2500 128 86 52 31 103 76
250amol-50amol 33 2.322 50 250 363 288 189 147 324 252
12500amol-50000amol 11 2.000 12500 50000 219 147 213 166 169 109
125amol-500amol 23 2.000 125 500 22 17 7 2 38 33
12500amol-5000amol 16 1.322 5000 12500 23 15 7 4 36 26
125amol-50amol 30 1.322 50 125 289 242 140 112 227 180
12500amol-25000amol 2 1.000 12500 25000 12 10 85 79 29 15
125amol-250amol 8 1.000 125 250 18 10 1 0 2 1
25000amol-50000amol 13 1.000 25000 50000 142 88 64 42 95 62
2500amol-5000amol 19 1.000 2500 5000 12 9 2 0 16 10
250amol-500amol 26 1.000 250 500 6 2 1 0 3 2
resMQ1 <- extractTestingResults(testMQ, compNo=1, thrsh=0.05, FCthrs=2)
resPD1 <- extractTestingResults(testPD, compNo=1, thrsh=0.05, FCthrs=2)
resPL1 <- extractTestingResults(testPL, compNo=1, thrsh=0.05, FCthrs=2)

You can see that in numerous cases much more than the 48 UPS1 proteins showed up significant, ie yeast proteins supposed to remain constant also showed up in part as ‘sigificantly changing’. However, some proteins with enthousiastic FDR values have very low log-FC amplitude and will be removed by filtering in the following steps.

par(mar=c(5.5, 4.7, 4, 1))
imageW(table1[,c("sig.PD.BH","sig.MQ.BH","sig.PL.BH" )], col=rev(RColorBrewer::brewer.pal(9,"YlOrRd")),
  transp=FALSE, tit="Number of BH.FDR passing proteins by the quantification approaches")
mtext("Dark red for high number signif proteins", cex=0.75)

In the original Ramus et al 2016 et al paper only 3 pairwise comparisons were further analyzed :

## Selection in Ramus paper
kable(table1[which(rownames(table1) %in% colnames(testPD$BH)[c(2,21,27)]),], caption="Selected pairwise comparisons (as in Ramus et al)", align="c")
Selected pairwise comparisons (as in Ramus et al)
index log2rat conc1 conc2 sig.PD.BH sig.PD.lfdr sig.MQ.BH sig.MQ.lfdr sig.PL.BH sig.PL.lfdr
50000amol-500amol 27 6.644 500 50000 363 324 466 428 370 318
50000amol-5000amol 21 3.322 5000 50000 312 237 471 446 319 281
12500amol-25000amol 2 1.000 12500 25000 12 10 85 79 29 15

Here we’ll consider all possible pairwise comparisons, as shown below.

Volcano Plots

Volcano-plots offer additional insight in how statistical test results relate to log-fold-change of pair-wise comparisons. In addition, we can mark the different protein-groups (or species) by different symbols, see also the general vignette ‘wrProteoVignette1’ (from this package) and the vignette to the package wrGraph. Counting the number of proteins passing a classical threshold for differential expression combined with a filter for minimum log-fold-change is a good way to start.

As mentioned, the dataset from Ramus et al 2016 contains 9 different levels of UPS1 concentrations (Ramus Data Set), in consequence 36 pair-wise comparisons are possible. Again, plotting all possible Volcano plots would make way too crowded plots, instead we’ll try to summarize (see ROC curves), cluster into groups and finally plot only a few representative ones.

ROC for Multiple Pairs

Receiver Operator Curves (ROC) curves display sensitivity (True Positive Rate) versus 1-Specificity (False Positive Rate). They are typically used as illustrate and compare the discriminiative capacity of a yes/no decision system (here: differential abundance or not), see eg also the original publication Hand and Till 2001.

The data get constructed by sliding through a panel of threshold-values for the statistical tests instead of just using 0.05. Due to the experimental setup we know that all yeast proteins should stay constant and only UPS1 proteins (see also Experimental Setup) are expected to change. For each of these threshold values one counts the number of true positives (TP), false positives (FP) etc, allowing then to calculate sensitivity and specificity.

In the case of bechmarking quantitation efforts, ROC curves are used to judge how well heterologous spikes UPS1 proteins can be recognized as differentially abundant while constant yeast matrix proteins should not get classified as differential. Finally, ROC curves let us also gain some additional insights in the question which cutoff may be optimal or if the commonly used 5-percent FDR threshld cutoff allows getting the best out of the testing system.

The next step consists in calculating the area under the curve (AUC) for the individual profiles of each pairwise comparison. Below, these calculations of summarizeForROC() are run in batch.

## calulate  AUC for each ROC
layout(1)
rocPD <- lapply(table1[,1], function(x) summarizeForROC(testPD, useComp=x, annotCol="SpecType", spec=c("mainSpecies","spike"), tyThr="BH", plotROC=FALSE,silent=TRUE))
rocMQ <- lapply(table1[,1], function(x) summarizeForROC(testMQ, useComp=x, annotCol="SpecType", spec=c("mainSpecies","spike"), tyThr="BH", plotROC=FALSE,silent=TRUE))
rocPL <- lapply(table1[,1], function(x) summarizeForROC(testPL, useComp=x, annotCol="SpecType", spec=c("mainSpecies","spike"), tyThr="BH", plotROC=FALSE,silent=TRUE))

# we still need to add the names for the pair-wise groups:
names(rocPD) <- names(rocMQ) <- names(rocPL) <- rownames(table1)
AucAll <- cbind(ind=table1[match(names(rocPD), rownames(table1)),"index"], clu=NA,
  PD=sapply(rocPD, AucROC), MQ=sapply(rocMQ, AucROC), PL=sapply(rocPL, AucROC) )

To provide a quick overview, the clustered AUC values are displayed as PCA :

try(biplot(prcomp(AucAll[,names(methNa)]), cex=0.7, main="PCA of AUC from ROC Curves"))

On this PCA one can see the three software types in red. We can see that AUC values from MaxQuant correlate somehow less to Proline and ProteomeDiscoverer (red arrows). The pair-wise ratios constructed from the different rations are shown in black. They form a compact area with mostly wide ratios (one rather high and one low concentration of UPS1 proteins). Besides, there is a number of disperse points, typically containig the point of 125 and/or 250 fmol. These disperse points do not replicate well and follow their own characteristics captured by PC2.

Now we are ready to inspect the 5 clusters in detail :

Grouping of ROC Curves to Display Representative Ones

As mentioned, there are too many pair-wise combinations available for plotting and inspecting all ROC-curves. So we can try to group similar pairwise comparison AUC values into clusters and then easily display representative examples for each cluster/group. Again, we (pre)define that we want to obtain 5 groups (like customer-ratings from 5 to 1 stars), a k-Means clustering approach was chosen.

## number of groups for clustering
nGr <- 5
## K-Means clustering
kMAx <- stats::kmeans(standardW(AucAll[,c("PD","MQ","PL")]), nGr)$cluster
   table(kMAx)
#> kMAx
#>  1  2  3  4  5 
#>  1  6  7  8 14
AucAll[,"clu"] <- kMAx
AucAll <- reorgByCluNo(AucAll, cluNo=kMAx, useColumn=c("PD","MQ","PL"))
AucAll <- cbind(AucAll, iniInd=table1[match(rownames(AucAll), rownames(table1)), "index"])
colnames(AucAll)[1:(which(colnames(AucAll)=="index")-1)] <- paste("Auc",colnames(AucAll)[1:(which(colnames(AucAll)=="index")-1)], sep=".")
AucAll[,"cluNo"] <- rep(nGr:1, table(AucAll[,"cluNo"]))        # make cluNo descending

kMAx <- AucAll[,"cluNo"]      # update
  table(AucAll[,"cluNo"])
#> 
#>  1  2  3  4  5 
#>  6  1  8  7 14
 ## note : column 'index' is relative to table1, iniInd to ordering inside objects from clustering

To graphically summarize the AUC values, the clustered AUC values are plotted accompagnied by the geometric mean:

try(profileAsClu(AucAll[,c(1:length(methNa),(length(methNa)+2:3))], clu="cluNo", meanD="geoMean", tit="Pairwise Comparisons as Clustered AUC from ROC Curves",
  xlab="Comparison number", ylab="AUC", meLty=1, meLwd=3))

From this figure we can see clearly that there are some pairwise comparisons where all initial analysis-software results yield high AUC values, while other pairwise comparisons less discriminative power.

Again, now we can select a representative pairwise-comparison for each cluster (from the center of each cluster):

AucRep <- table(AucAll[,"cluNo"])[rank(unique(AucAll[,"cluNo"]))]   # representative for each cluster
AucRep <- round(cumsum(AucRep) -AucRep/2 +0.1)

## select representative for each cluster
kable(round(AucAll[AucRep,c("Auc.PD","Auc.MQ","Auc.PL","cluNo")],3), caption="Selected representative for each cluster ", align="c")
Selected representative for each cluster
Auc.PD Auc.MQ Auc.PL cluNo
2500amol-50000amol 0.967 0.996 0.992 5
12500amol-50000amol 0.936 0.952 0.979 4
2500amol-500amol 0.897 0.886 0.963 3
125amol-2500amol 0.720 0.502 0.947 2
125amol-250amol 0.548 0.445 0.498 1

Now we can check if some experimental UPS1 log-fold-change have a bias for some clusters.

ratTab <- sapply(5:1, function(x) { y <- table1[match(rownames(AucAll),rownames(table1)),]
  table(factor(signif(y[which(AucAll[,"cluNo"]==x),"log2rat"],1), levels=unique(signif(table1[,"log2rat"],1))) )})
colnames(ratTab) <- paste0("\nclu",5:1,"\nn=",rev(table(kMAx)))
layout(1)
imageW(ratTab, tit="Frequency of rounded log2FC in the 5 clusters", xLab="log2FC (rounded)", col=RColorBrewer::brewer.pal(9,"YlOrRd"),las=1)
mtext("Dark red for enrichment of given pair-wise ratio", cex=0.7)

We can see, that the cluster of best ROC-curves (cluster 5) covers practically all UPS1 log-ratios from this experiment without being restricted just to the high ratios.

Plotting ROC Curves for the Best Cluster (the ‘+++++’)

colPanel <- 2:5
gr <- 5
j <- match(rownames(AucAll)[AucRep[6-gr]], colnames(testPD$t))

## table of all proteins in cluster
useLi <- which(AucAll[,"cluNo"]==gr)
tmp <- cbind(round(as.data.frame(AucAll)[useLi,c("cluNo","Auc.PD","Auc.MQ","Auc.PL")],3),
  as.data.frame(table1)[match(names(useLi),rownames(table1)), c(2,5,7,9)])
kable(tmp, caption="AUC details for best pairwise-comparisons ", align="c")
AUC details for best pairwise-comparisons
cluNo Auc.PD Auc.MQ Auc.PL log2rat sig.PD.BH sig.MQ.BH sig.PL.BH
250amol-50000amol 5 0.981 0.999 0.995 7.644 382 458 328
50000amol-500amol 5 0.981 0.998 0.994 6.644 363 466 370
25000amol-250amol 5 0.978 0.992 0.997 6.644 236 374 214
125amol-50000amol 5 0.975 0.996 0.995 8.644 370 391 413
12500amol-250amol 5 0.974 0.993 0.997 5.644 136 108 132
25000amol-500amol 5 0.977 0.990 0.994 5.644 253 370 203
2500amol-50000amol 5 0.967 0.996 0.992 4.322 302 488 263
12500amol-500amol 5 0.969 0.987 0.993 4.644 159 97 132
50000amol-50amol 5 0.973 0.984 0.990 9.966 578 488 575
50000amol-5000amol 5 0.965 0.994 0.986 3.322 312 471 319
25000amol-50amol 5 0.973 0.981 0.990 8.966 597 440 548
25000amol-2500amol 5 0.957 0.985 0.993 3.322 48 458 73
12500amol-2500amol 5 0.957 0.995 0.982 2.322 20 78 40
12500amol-50amol 5 0.932 0.985 0.988 7.966 521 324 483
## frequent concentrations :
layout(matrix(1:2), heights=c(1,2.5))
plotConcHist(mat=tmp, ref=table1)

## representative ROC
jR <- match(rownames(AucAll)[AucRep[6-gr]], names(rocPD))
plotROC(rocPD[[jR]], rocMQ[[jR]], rocPL[[jR]], col=colPanel, methNames=methNa, pointSi=0.8, xlim=c(0,0.45),
  txtLoc=c(0.12,0.1,0.033), tit=paste("Cluster",gr," Example: ",names(rocPD)[jR]), legCex=1)

## This required package 'wrGraph' at version 1.2.5 (or higher)
if(packageVersion("wrGraph")  >= "1.2.5") {
  layout(matrix(1:4,ncol=2))
  try(VolcanoPlotW(testPD, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[1], expFCarrow=TRUE, silent=TRUE),silent=TRUE)
  try(VolcanoPlotW(testMQ, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[2], expFCarrow=TRUE, silent=TRUE),silent=TRUE)
  try(VolcanoPlotW(testPL, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[3], expFCarrow=TRUE, silent=TRUE),silent=TRUE)}

ROC Curves for 2nd Best Cluster (the ‘++++’)

gr <- 4
j <- match(rownames(AucAll)[AucRep[6-gr]], colnames(testPD$t))

## table of all proteins in cluster
useLi <- which(AucAll[,"cluNo"]==gr)
tmp <- cbind(round(as.data.frame(AucAll)[useLi,c("cluNo","Auc.PD","Auc.MQ","Auc.PL")],3),
  as.data.frame(table1)[match(names(useLi),rownames(table1)), c(2,5,7,9)])
kable(tmp, caption="AUC details for cluster '++++' pairwise-comparisons ", align="c")
AUC details for cluster ‘++++’ pairwise-comparisons
cluNo Auc.PD Auc.MQ Auc.PL log2rat sig.PD.BH sig.MQ.BH sig.PL.BH
250amol-5000amol 4 0.970 0.949 0.993 4.322 200 84 144
5000amol-500amol 4 0.963 0.927 0.987 3.322 181 84 134
25000amol-5000amol 4 0.954 0.934 0.979 2.322 52 384 72
12500amol-50000amol 4 0.936 0.952 0.979 2.000 219 213 169
125amol-25000amol 4 0.956 0.914 0.995 7.644 346 352 303
12500amol-5000amol 4 0.923 0.927 0.939 1.322 23 7 36
2500amol-250amol 4 0.927 0.879 0.982 3.322 113 36 104
## frequent concentrations :
layout(matrix(1:2), heights=c(1,2.5))
plotConcHist(mat=tmp, ref=table1)

## representative ROC
jR <- match(rownames(AucAll)[AucRep[6-gr]], names(rocPD))
plotROC(rocPD[[jR]], rocMQ[[jR]], rocPL[[jR]], col=colPanel, methNames=methNa, pointSi=0.8, xlim=c(0,0.45),
  txtLoc=c(0.12,0.1,0.033), tit=paste("Cluster",gr," Example: ",names(rocPD)[jR]), legCex=1)

if(packageVersion("wrGraph")  >= "1.2.5"){
  layout(matrix(1:4,ncol=2))
  try(VolcanoPlotW(testPD, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[1], expFCarrow=TRUE, silent=TRUE),silent=TRUE)
  try(VolcanoPlotW(testMQ, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[2], expFCarrow=TRUE, silent=TRUE),silent=TRUE)
  try(VolcanoPlotW(testPL, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[3], expFCarrow=TRUE, silent=TRUE),silent=TRUE)}

ROC Curves for the 3rd Best Cluster (the ‘+++’)

gr <- 3
j <- match(rownames(AucAll)[AucRep[6-gr]], colnames(testPD$t))

## table of all proteins in cluster
useLi <- which(AucAll[,"cluNo"]==gr)
tmp <- cbind(round(as.data.frame(AucAll)[useLi,c("cluNo","Auc.PD","Auc.MQ","Auc.PL")],3),
  as.data.frame(table1)[match(names(useLi),rownames(table1)), c(2,5,7,9)])
kable(tmp, caption="AUC details for cluster '+++' pairwise-comparisons ", align="c")
AUC details for cluster ‘+++’ pairwise-comparisons
cluNo Auc.PD Auc.MQ Auc.PL log2rat sig.PD.BH sig.MQ.BH sig.PL.BH
125amol-5000amol 3 0.887 0.904 0.973 5.322 273 98 209
12500amol-125amol 3 0.876 0.886 0.992 6.644 228 122 234
5000amol-50amol 3 0.921 0.872 0.958 6.644 610 379 528
2500amol-500amol 3 0.897 0.886 0.963 2.322 128 52 103
2500amol-5000amol 3 0.811 0.957 0.922 1.000 12 2 16
25000amol-50000amol 3 0.855 0.882 0.932 1.000 142 64 95
2500amol-50amol 3 0.863 0.851 0.906 5.644 563 333 497
12500amol-25000amol 3 0.893 0.730 0.940 1.000 12 85 29
## frequent concentrations :
layout(matrix(1:2), heights=c(1,2.5))
plotConcHist(mat=tmp, ref=table1)

## representative ROC
jR <- match(rownames(AucAll)[AucRep[6-gr]], names(rocPD))
plotROC(rocPD[[jR]],rocMQ[[jR]],rocPL[[jR]], col=colPanel, methNames=methNa, pointSi=0.8, xlim=c(0,0.45),
  txtLoc=c(0.12,0.1,0.033), tit=paste("Cluster",gr," Example: ",names(rocPD)[jR]), legCex=1)

if(packageVersion("wrGraph")  >= "1.2.5"){
  layout(matrix(1:4,ncol=2))
  try(VolcanoPlotW(testPD, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[1], expFCarrow=TRUE, silent=TRUE),silent=TRUE)
  try(VolcanoPlotW(testMQ, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[2], expFCarrow=TRUE, silent=TRUE),silent=TRUE)
  try(VolcanoPlotW(testPL, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[3], expFCarrow=TRUE, silent=TRUE),silent=TRUE)}

ROC Curves for the 4th Best Cluster (the ‘++’)

gr <- 2
j <- match(rownames(AucAll)[AucRep[6-gr]], colnames(testPD$t))

## table of all proteins in cluster
useLi <- which(AucAll[,"cluNo"]==gr)
tmp <- cbind(round(as.data.frame(AucAll)[useLi,c("cluNo","Auc.PD","Auc.MQ","Auc.PL")],3),
  as.data.frame(table1)[match(names(useLi),rownames(table1)), c(2,5,7,9)])
kable(tmp, caption="AUC details for cluster '++' pairwise-comparisons ", align="c")
AUC details for cluster ‘++’ pairwise-comparisons
cluNo Auc.PD Auc.MQ Auc.PL log2rat sig.PD.BH sig.MQ.BH sig.PL.BH
125amol-2500amol 2 0.72 0.502 0.947 4.322 195 93 182
## frequent concentrations :
layout(matrix(1:2), heights=c(1,2.5))
plotConcHist(mat=tmp, ref=table1)

## representative ROC
jR <- match(rownames(AucAll)[AucRep[6-gr]], names(rocPD))
plotROC(rocPD[[jR]], rocMQ[[jR]], rocPL[[jR]], col=colPanel, methNames=methNa, pointSi=0.8, xlim=c(0,0.45),
  txtLoc=c(0.12,0.1,0.033), tit=paste("Cluster",gr," Example: ",names(rocPD)[jR]), legCex=1)