1 Introduction

1.1 Mini overview of magmaR, magma, and the Mount Etna data library system

This vignette focuses on how to upload data to magma via its R-client, magmaR.

Magma is the data warehouse of the Mount Etna data library system.

For a deeper overview of the structure of magma and the Mount Etna data library system, please see the download-focused vignette, vignette("Download", package = "magmaR"), or Mount Etna’s own documentation, here https://mountetna.github.io/magma.html.

1.2 Scope of this vignette

This vignette assumes that you have already gone through the download-focused vignette, vignette("Download", package = "magmaR"), which covers how to 1) install magmaR, 2) use a token for authentication, and 3) switch, if needed, between the production / staging / development magma environments.

This vignette focuses on use-cases where a user wishes to push data, from their own system, to magma.

Not all Mount Etna users have write privileges, so not all magmaR users will have need for this vignette.

For those that do, please note: Sending data to magma is an advanced use-case which needs to be treated with due care. The functions involved have the ability to overwrite data, so it is imperative, for data integrity purposes, that inputs to these functions are double-checked in order to make sure that they target only the intended records & attributes.

Also note that a users’ write-privileges are project-specific, so it is unlikely that you will be able to run any code, exactly as it exists in this vignette, without getting an authorization error. (That also means you don’t run the risk of breaking our download vignette by testing out any fun alterations of the code in here… trade-offs =] .)

1.3 How magmaR functions work

In general, magmaR functions will:

  1. Take in inputs from the user.
  2. Make a curl request that calls on a magma function to either send or receive desired data.
  3. Restructure any received data, typically minimally, to be more accessible for downstream analyses.
  4. Return the output.

Steps 3&4 are very simple for upload functions because the only return from magma will be curl request attributes that indicate whether the call to magma/update worked.

So in this vignette, our singular focus will be on how to input your data so that magmaR can send it to magma properly.

2 magmaR’s data upload functions:

magma has just one data input function, /update.

magmaR provides two functions methods for sending data into magma via this function, updateValues() and updateMatrix().

2.1 updateValues()

updateValues() is the main workhorse function of magmaR’s data upload capabilities. It largely mimics magma/update except in that the hash structures that are used by magma/update do not exist within R. Thus the format for the revisions input is a nested list, rather than a nested hash.

The function has 2 main inputs, project and revisions:

  • project is simply the String name of the project that you wish to upload data to; e.g. updateValues(project = "example", ...).
  • revisions includes information about which model(s), which record(s), and which attribute(s) to update, and with what value(s). Each of these levels is encoded as a nested list where the format looks something like:
revisions = list(
    modelName = list(
        recordName = list(
            attributeName = value(s)
        )
    )
)

To make more than one update within a single call, you can simply add an additional index at any of these levels.

So for example, the below would update…

# 2 attributes for the same record
revisions = list(
    modelName = list(
        recordName = list(
            attributeName1 = value(s),
            attributeName2 = value(s)
            )
        )
    )

# The same attribute for 2 different records
revisions = list(
    modelName = list(
        recordName1 = list(
            attributeName1 = value(s)
            ),
        recordName2 = list(
            attributeName1 = value(s)
            )
        )
    )

# Some attribute for 2 different records of two different models
revisions = list(
    modelName1 = list(
        recordName = list(
            attributeName = value(s)
            )
        ),
    modelName2 = list(
        recordName = list(
            attributeName = value(s)
            )
        )
    )

Let’s try it out with some real examples which target the same “example” project that we used in the download vignette.

To refresh, the model map for this project is below.

example_project_map

The “biospecimen” and “rna_seq” models that we will target have attributes…

library(magmaR)
retrieveAttributes(target = prod, "example", "biospecimen")
## [1] "subject"          "name"             "biospecimen_type" "rna_seq"         
## [5] "flow"
retrieveAttributes(target = prod, "example", "rna_seq")
## [1] "biospecimen"     "tube_name"       "expression_type" "cell_number"    
## [5] "gene_tpm"        "gene_counts"     "fraction"

Say we wanted to update the “biospecimen_type” attribute of 2 records from the “biospecimen” model, and the “fraction” attribute for 1 record from the “rna_seq” model. The code for this could be:

# Create 'revisions'
revs <- list(
    "biospecimen" = list(
        "EXAMPLE-HS1-WB1" = list(biospecimen_type = "Whole Blood"),
        "EXAMPLE-HS2-WB1" = list(biospecimen_type = "Whole Blood")
        ),
    "rna_seq" = list(
        "EXAMPLE-HS1-WB1-RSQ1" = list(fraction = "Tcells")
    )
)

# Run update()
updateValues(
    target = prod,
    project = "example",
    revisions = revs)

A user would then see a summary of models/records to be updated, followed by a prompt to proceed or not:

For model "biospecimen", this update() will update 2 records:
    EXAMPLE-HS1-WB1
    EXAMPLE-HS2-WB1
For model "rna_seq", this update() will update 1 records:
    EXAMPLE-HS1-WB1-RSQ1

Proceed, Y/n?

It is highly recommended that these outputs be checked carefully for accuracy before proceeding.

However, for running update() code in non-interactive modes, like scripts or .Rmd knits, this user-prompt step can also be turned off by adding the input auto.proceed = TRUE. Example:

updateValues(
    target = prod,
    project = "example",
    revisions = revs,
    auto.proceed = TRUE)
## For model "biospecimen", this update() will update 2 records:
##     EXAMPLE-HS1-WB1
##     EXAMPLE-HS2-WB1
## For model "rna_seq", this update() will update 1 records:
##     EXAMPLE-HS1-WB1-RSQ1
## /update: successful.

After a successful update() a user should see this message (unless verbose has been set to FALSE):

/update: successful.

2.2 Important Consideration when added NEW records

Contrary to the “update” portion of the function names, these functions can add totally new data to magma records. They are not solely restricted to updating old records. That said,

Please note that it is not easy to remove records with an incorrect ID. Only data library engineers have access to such functionality. So if you get a message like the one below, which should come whenever new records would be created, please heed the warning!

For model "rna_seq", this update() will create 3 NEW records:
    ID1
    ID2
    ID3
WARNING: Check the above carefully. Once created, there is no easy way to remove records from magma.

2.3 updateMatrix(), a type-dedicated update function

As the name suggests, updateMatrix() is a convenient wrapper function of updateValues() that is meant specifically for matrix data. It allows a user to point magmaR to either a file containing matrix data, or to a readily constructed matrix, without needing to perform the manual conversion of such data to the revisions-input format.

Internally, the function performs some necessary validations, adjusts the matrix into the proper revisions-input format, then passes it along to updateValues(). After this point, functionality is similar to what has already been described above: the targeted models/records will be summarized and the user will be prompted before the actual magma/update will be performed (unless that prompt is turned off with the auto.proceed input).

Usage differences compared to updateValues(): Here, all of projectName, modelName, and attributeName must be given as their own separate inputs in addition to the matrix input. The matrix must be formatted to have column names equal to the recordNames that should be updated, and row names that are among the allowed ‘options’ for the target attribute.

To update the raw counts of our “rna_seq” model from either a csv, a tsv, or directly from a matrix, we could use the code below:

### From a csv
updateMatrix(
    target = prod,
    projectName = "example",
    modelName = "rna_seq",
    attributeName = "gene_counts",
    matrix = "../tests/testthat/rna_seq_counts.csv")

### From a tsv, set the 'separator' input to "\t"
updateMatrix(
    target = prod,
    projectName = "example",
    modelName = "rna_seq",
    attributeName = "gene_counts",
    matrix = "../tests/testthat/rna_seq_counts.tsv",
    separator = "\t")

### From an already loaded matrix:
matrix <- retrieveMatrix(target = prod, "example", "rna_seq", "all", "gene_counts")
updateMatrix(
    target = prod,
    projectName = "example",
    modelName = "rna_seq",
    attributeName = "gene_counts",
    matrix = matrix)

Let’s explore the structure of matrix a little bit, noting a couple things:

head(matrix, n = c(6,2))
##       EXAMPLE-HS10-WB1-RSQ1 EXAMPLE-HS11-WB1-RSQ1
## gene1                     4                     8
## gene2                   231                    43
## gene3                   861                   155
## gene4                  2077                   427
## gene5                     3                     2
## gene6                     0                     0
  1. Column names for the matrix are record identifiers for the target “rna_seq” model.
  2. Row names for the matrix are the feature-names for the matrix. So for rna_seq data, for example, these will be gene names of some type, typically Ensembl IDs.

As with updateValues(), a successful update via updateMatrix() should produce this final output line (unless one sets verbose = FALSE):

/update: successful.

3 Session information

sessionInfo()
## R version 4.0.5 (2021-03-31)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dittoSeq_1.3.14  ggplot2_3.3.3    magmaR_1.0.1     vcr_0.6.0       
## [5] BiocStyle_2.18.1
## 
## loaded via a namespace (and not attached):
##  [1] MatrixGenerics_1.2.1        Biobase_2.50.0             
##  [3] httr_1.4.2                  sass_0.3.1                 
##  [5] jsonlite_1.7.2              bslib_0.2.4                
##  [7] assertthat_0.2.1            highr_0.9                  
##  [9] BiocManager_1.30.12         triebeard_0.3.0            
## [11] urltools_1.7.3              stats4_4.0.5               
## [13] GenomeInfoDbData_1.2.4      yaml_2.2.1                 
## [15] ggrepel_0.9.1               pillar_1.6.0               
## [17] lattice_0.20-44             glue_1.4.2                 
## [19] digest_0.6.27               RColorBrewer_1.1-2         
## [21] GenomicRanges_1.42.0        XVector_0.30.0             
## [23] colorspace_2.0-1            cowplot_1.1.1              
## [25] htmltools_0.5.1.1           Matrix_1.3-3               
## [27] plyr_1.8.6                  pkgconfig_2.0.3            
## [29] pheatmap_1.0.12             httpcode_0.3.0             
## [31] magick_2.7.2                bookdown_0.22              
## [33] zlibbioc_1.36.0             purrr_0.3.4                
## [35] scales_1.1.1                webmockr_0.8.0             
## [37] whisker_0.4                 tibble_3.1.1               
## [39] farver_2.1.0                generics_0.1.0             
## [41] IRanges_2.24.1              ellipsis_0.3.2             
## [43] withr_2.4.2                 SummarizedExperiment_1.20.0
## [45] BiocGenerics_0.36.1         magrittr_2.0.1             
## [47] crayon_1.4.1                evaluate_0.14              
## [49] fansi_0.4.2                 tools_4.0.5                
## [51] lifecycle_1.0.0             matrixStats_0.58.0         
## [53] stringr_1.4.0               S4Vectors_0.28.1           
## [55] munsell_0.5.0               DelayedArray_0.16.3        
## [57] compiler_4.0.5              jquerylib_0.1.4            
## [59] GenomeInfoDb_1.26.7         rlang_0.4.11               
## [61] grid_4.0.5                  RCurl_1.98-1.3             
## [63] ggridges_0.5.3              SingleCellExperiment_1.12.0
## [65] labeling_0.4.2              bitops_1.0-7               
## [67] base64enc_0.1-3             rmarkdown_2.7              
## [69] gtable_0.3.0                DBI_1.1.1                  
## [71] curl_4.3.1                  fauxpas_0.5.0              
## [73] R6_2.5.0                    gridExtra_2.3              
## [75] knitr_1.33                  dplyr_1.0.5                
## [77] utf8_1.2.1                  stringi_1.5.3              
## [79] parallel_4.0.5              crul_1.1.0                 
## [81] Rcpp_1.0.6                  vctrs_0.3.8                
## [83] tidyselect_1.1.1            xfun_0.22