PARAFAC analysis of EEM data to separate DOM components in R

staRdom: spectroscopic analysis of dissolved organic matter in R

Matthias Pucher matthias.pucher@wcl.ac.at

July 28 2020

1 Introduction

staRdom is a package for R version 3 (R Development Core Team 2019) to analyse fluorescence and absorbance data of dissolved organic matter (DOM). The important features are:

2 Introduction

staRdom is a package for R version (R Development Core Team 2019) to analyse fluorescence and absorbance data of dissolved organic matter (DOM). The important features are:

staRdom has been developed and is maintained at WasserCluster Lunz (http://www.wcl.ac.at/) and the University of Natural Resources and Life Sciences, Vienna(http://www.boku.ac.at/).

A comparison in quality of results and performance of the PARAFAC algorithms in staRdom and drEEM can be found in Pucher et al. (2019). The analysis process was developed and discussed in other papers and tutorials (e.g. Murphy et al. 2013). The aim of this package is to provide an elaborate and flexible way of using PARAFAC analysis of EEM data on the R platform. The offered functions follow the concept of Murphy et al. (2013). We strongly recommend to read this paper to understand the underlying procedure of the analysis.

For easy data correction and calculations of peaks, indices and slope parameters in R but without using R functions and codes please see vignette for basic analysis. We use functions from the R package eemR, for information on eemR and its functions please see the eemR vignette. Details on the actual PARAFAC calculation can be found in the multiway documentation.

This tutorial was created using R version 3.6.3 and the packages dplyr (Wickham et al. 2020) and tidyr (Wickham, Henry, and RStudio 2020).

library(dplyr)
library(tidyr)

2.1 Aim of this document

This document describes a complete fluorescence and absorbance analysis of example data using R functions and offers a variety of options and ways to gain validated results.

2.2 Remark on errors and warnings

In R, there can be errors and warnings. Errors stop the calculations and no results are returned because of a severe problem. Warnings do not stop calculations but are hints, that something unusual happened. If you encounter any warnings please think of the reason. In some cases, this reason might be in-line with what you want and is no problem at all. An examples for a not serious warnings are interpolated values because of missing wavelengths or NAs introduced by coercion in some cases. Please have a look at your data to find actual problems.

2.3 Install and load staRdom

staRdom can be installed directly from a CRAN repository:

install.packages("staRdom")

You can also install staRdom from GitHub. The version on GitHub is sometimes a more recent one.

install.packages("devtools") # Run this only, if devtools is not installed already.
devtools::install_github("MatthiasPucher/staRdom")

staRdom can then be loaded:

library("staRdom")
## Loading required package: ggplot2
## Loading required package: eemR
## Loading required package: parallel
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

2.4 Parallel processing

Some of the functions can use several CPU cores in parallel to speed up the calculations. You can set the number of parallel processes to be used. In tests, best results were achieved setting cores to the number of physical cores in your computer. Using virtual cores introduced by Hyper-threading did increase calculation times again. detectCores(logical = FALSE) will give you the appropriate number. On some computers this function returns the number of virtual cores, which has to be checked and set manually then.

cores <- detectCores(logical = FALSE)

2.5 Overview of analysis steps

The diagram below shows the most important steps in the analysis of fluorescence and absorbance data to identify DOM parameters. After importing the data and applying the desired correction steps, peaks, indices and absorbance parameters can be calculated. Besides that, after the data correction, compounds can be separated using a PARAFAC model. The PARAFAC model development includes cyclical steps to come to a satisfying result. Model validation is very important at that point to get trustful results. As a last step, the compounds determined in the model can be linked to already published ones or other experimental data and by that interpreted in a biogeochemical context.

3 Example data coming with the package

You can run a complete data correction and analysis as shown in this example with the data provided by the package and downloaded within the tutorial. However, you can start the tutorial with your own data right away. The function system.file() used below returns the folders, where the example data is stored in. In case you use your own, you can just type in the path to your data as a string (e.g. “C:/folder/another folder”) without using system.file.

3.1 Raw EEM data

The example EEM data is saved in a folder and accessible by system.file("extdata/EEMs", package = "staRdom"). Due to package size issues, only a small amount of samples is included and for a complete reproduction of the PARAFAC analysis demonstrated below, the drEEM data has to be downloaded (see below how to accomplish that).

3.2 Raw absorbance data

The absorbance data is saved in a folder accessible by system.file("extdata/absorbance", package = "staRdom").

3.3 Additional raw data

There is a table with data on diluted samples included. The table can serve as an example in case you worked with diluted samples. It is saved in a folder accessible by system.file("extdata/metatable_dreem.csv", package = "staRdom").

3.4 Corrected EEM data

The corrected EEM data set is loaded from the drEEM website.

3.5 PARAFAC model

An already calculated PARAFAC model was included in the package. It can be loaded into the R environment by data(pf_models).

4 Import raw data

For a complete analysis in staRdom you need fluorescence and absorbance data. In many cases also a metatable is needed. Reading and linking the data needs some preparations and attention. Below we show how to import these data using the provided example data.

EEM data import is done with eem_read (package eemR). Currently you can use it to import from Varian Cary Eclipse, Horiba Aqualog, Horiba Fluoromax-4, Shimadzu and Hitachi F-7000 instruments as well as generic csv files.

folder <- system.file("extdata/EEMs/", package = "staRdom") # folder containing example EEMs
eem_list <- eem_read(folder, recursive = TRUE, import_function = eem_csv) # in case you use your own data, just replace folder by a path. e.g. "C:/folder/another folder" and change import_function according to instrument.

This is an example to load data from instrument specific ASCII files. import_function has to be set accordingly: Varian Cary Eclipse ("cary"), Horiba Aqualog ("aqualog"), Horiba Fluoromax-4 ("fluoromax4"), Shimadzu ("shimadzu"), Hitachi F-7000 (eem_hitachi) and generic csv files (eem_csv when excitation wavelengths are in columns or eem_csv2 when emission wavelengths are in columns). Please note, eem_hitachi, eem_csv and eem_csv2 are without hyphens!

eem_list <- eem_read(folder, import_function = "cary")

If your instrument is not amongst the mentioned ones, eemR offers the opportunity to create custom import functions (see according vignette).

To have a look at your data, you can plot the samples in several plots using eem_overview_plot. ggeem is another function that plots EEMs, but it produces only one plot, which can be overloaded in case of many samples. spp defines the number of samples per plot. Not present data is not shown (light greyish), NAs are dark grey. contour = TRUE adds contour lines to the EEM plots.

eem_overview_plot(eem_list, spp=9, contour = TRUE)
## [[1]]

Absorbance data is imported with absorbance_read. It is read from CSV or TXT files. absorbance_path is either a directory and all files in there are read in, or a single file that is read in. Tables can either contain data from one sample or from several samples in columns. The first column is considered the wavelength column. A multi-sample file must have sample names as column names. All tables are combined to one with one wavelength column and one column for each sample containing the absorbance data.

absorbance_path = system.file("extdata/absorbance", package = "staRdom") # load example data, set a path without using system.file to use your own data e.g. "C:/folder/another folder"

To load the data, please use:

absorbance <- absorbance_read(absorbance_path, cores = cores) # load csv or txt tables in folder

In some cases, it is necessary to use individual dilution factors, Raman areas or pathlengths. This data is read using the base R function read.table. Please check that the column separator of your file is the same as in read.table (sep, which is defined as blank space here, hence sep = “ “). The same is true for the decimal separator of your numbers (dec, which is defined as dec = “,” in our case). Here, we import a table containing meta data provided with the package as an example. In the example only dilution data is provided via this table.

metatable <- system.file("extdata/metatable_dreem.csv",package = "staRdom") # path to example data, can be replaced by a path to your own data
meta <- read.table(metatable, header = TRUE, sep = ",", dec = ".", row.names = 1) # load data

If you have not created a table containing meta data, you can use eem_metatamplate to create a new one. It can be written to a file using write.csv. Here is an example:

eem_metatemplate(eem_list, absorbance) %>%
  write.csv(file="metatable.csv", row.names = FALSE)

4.1 Check data

The eem_checkdata function prints out sample names and folders, where possible problems are detected. Additionally a named list with these sample names can be saved (variable problem in the example code below).

The data can be checked for possible incorrect entries (in brackets names of the according list elements):

Possible_problem_found is a logic variable in the list, stating whether a possible problem was found or not.

In case of problems, you need to reorganise your data manually and revise the steps mentioned above. The eem_checkdata function can be run any time during the process to see if any changes in the dataset caused inconsistencies. No correction of these problems is done automatically!

problem <- eem_checkdata(eem_list,absorbance,meta,metacolumns = c("dilution"),error=FALSE)
## NAs were found in the following samples (ratio):  
## d423sf (0), d457sf (0), d492sf (0), d667sf (0), dblank_di25se06 (0), d433sf (0), d437sf (0), d441sf (0), dblank_mq11my (0.02), 
## Please think about interpolating the data, this might give more stable and meaningful PARAFAC models!
## EEM samples missing absorbance data:
## dblank_di25se06 in  
## /var/tmp/RtmpyYRByu/Rinst6b497eb8b281/staRdom/extdata/EEMs//di25se06
## dblank_mq11my in  
## /var/tmp/RtmpyYRByu/Rinst6b497eb8b281/staRdom/extdata/EEMs//mq11my

EEM data can contain NAs to perform a PARAFAC analysis as long as there is a certain amount of numeric values present. Depending on the data set, more than 15 to 20 % might cause non-converging models or models converging in a lokal minimum. We always suggest an interpolation of the data to get good results and a stable PARAFAC modelling process (Elcoroaristizabal et al. 2015).

Different EEM sizes are not a problem, because there is a way to address this as shown below. For a blank subtraction or a PARAFAC analysis, the available data for each sample must be similar.

The above mentioned absence of absorbance data for the blank samples can be justified, as absorbance of ultrapure water is theoretically 0 over the complete spectrum. In the further analysis, this is no problem.

5 Data preparation and correction

If you have used the template for the peak picking (vignette for basic analysis), the correction is already done and you can start a PARAFAC analysis with the eem_list resulting from this template.

5.1 Sample names

If you want to change your sample names, you can use eem_name_replace. In the example, “(FD3)” is removed as it is not part of the samples names but originates from the conversion of Hitachi FD3 files to txt files. You can use this function for any replacement in file names. Regular expressions (please see the help on regular expressions) can be used in search and replace patterns provided.

eem_list <- eem_name_replace(eem_list,c("\\(FD3\\)"),c(""))

5.2 Absorbance baseline correction

The instrumental baseline drift in absorbance data can be corrected by subtracting the mean of the absorbance at high wavelengths (Li and Hur 2017). The default is to use the spectrum between 680 and 700 nm but any other range can be set manually.

absorbance <- abs_blcor(absorbance,wlrange = c(680,700))

5.3 Spectral correction

Spectral correction is done to remove instrument-specific influences on the EEMs (DeRose and Resch-Genger 2010). Some instruments can do this automatically. The example EEMs are corrected using eem_spectral_cor. Correction vectors have to be provided in the same range as the EEM measurements. If this is not the case, the EEMs can be cut to this range using eem_range.

excorfile <- system.file("extdata/CorrectionFiles/xc06se06n.csv",package="staRdom")
Excor <- data.table::fread(excorfile)

emcorfile <- system.file("extdata/CorrectionFiles/mcorrs_4nm.csv",package="staRdom")
Emcor <- data.table::fread(emcorfile)

# adjust range of EEMs to cover correction vectors
eem_list <- eem_range(eem_list,ex = range(Excor[,1]), em = range(Emcor[,1]))

eem_list <- eem_spectral_cor(eem_list,Excor,Emcor)

5.4 Blank subtraction

Blanks are samples of ultrapure water that must contain either “nano”, “miliq”, “milliq”, “mq” or “blank” in their file names. They can be used to apply blank subtractionand Raman normalisation (see below) and are used for samples in the same (sub)folders. If multiple blanks were measured, they are averaged. The example data consist of only two folders with one blank each. Blanks are subtracted from each sample to reduce the effects of scatter bands and systematic errors (Murphy et al. 2013).

The blanks must be the same size as the samples they will be subtracted from. As different sample sizes were reported above, all samples will be brought to the same data range before. You can find a description on data and wavelength ranges below.

# extending and interpolation data
eem_list <- eem_extend2largest(eem_list, interpolation = 1, extend = FALSE, cores = cores)

# blank subtraction
eem_list <- eem_remove_blank(eem_list)
## A total of 1 blank EEMs will be averaged.
## A total of 1 blank EEMs will be averaged.
eem_overview_plot(eem_list, spp=9, contour = TRUE)
## [[1]]

5.5 Inner-filter effect correction

Inner-filter effects (IFE) occur when excitation light is absorbed by chromophores. A simple method to correct the IFE is to use the sample’s absorbance. The EEM is multiplied by a correction matrix corresponding to each wavelength pair (Kothawala et al. 2013). In case of a maximum absorbance larger than 1.5 cm⁻¹, the sample needs to be diluted, because the absorbance is to high to be corrected by the applied IFE function (or mathematical IFE correction in general). The example uses a path length of 5 cm (5 cm long cuvettes) for absorption measurements.

In the example below, there is a warning about no absorbance data for the blank samples. This warning is not a problem because in ultrapure water samples there is practically no IFE and the sample is removed in the further analysis.

eem_list <- eem_ife_correction(eem_list,absorbance, cuvl = 5)
## d423sf 
## Range of IFE correction factors: 1.0004 1.0288 
## Range of total absorbance (Atotal) : 1e-04 0.0049 
## 
## d457sf 
## Range of IFE correction factors: 1.0003 1.0208 
## Range of total absorbance (Atotal) : 1e-04 0.0036 
## 
## d492sf 
## Range of IFE correction factors: 1.0004 1.0241 
## Range of total absorbance (Atotal) : 1e-04 0.0041 
## 
## d667sf 
## Range of IFE correction factors: 1.0004 1.0249 
## Range of total absorbance (Atotal) : 1e-04 0.0043
## Warning in FUN(X[[i]], ...): No absorbance data was found for sample
## dblank_di25se06!No absorption data was found for sample dblank_di25se06!
## d433sf 
## Range of IFE correction factors: 1.0003 1.0241 
## Range of total absorbance (Atotal) : 1e-04 0.0041 
## 
## d437sf 
## Range of IFE correction factors: 1.0002 1.0131 
## Range of total absorbance (Atotal) : 0 0.0023 
## 
## d441sf 
## Range of IFE correction factors: 1.0002 1.016 
## Range of total absorbance (Atotal) : 0 0.0028
## Warning in FUN(X[[i]], ...): No absorbance data was found for sample
## dblank_mq11my!No absorption data was found for sample dblank_mq11my!
eem_overview_plot(eem_list, spp=9, contour = TRUE)
## [[1]]

5.6 Raman normalisation

Fluorescence intensities can differ between analyses on different fluorometers, different settings or different days on the same fluorometer. Therefore it can be normalised to a standard scale of Raman Units by dividing all intensities by the area of the Raman peak of a ultrapure water sample. All fluorescence intensities are divided by the area (integral) of the Raman peak (excitation of 350 nm between an emission of 371 nm and 428 nm) (Lawaetz and Stedmon 2009). Values are interpolated if the wavelengths are missing.

You can use blanks, numeric values or data frames as source for the values. In case you use blank samples, again the blanks in the same sub-folder as the sample are used. Blanks are recognised by having “blank”, “miliq”, “milliq”, “nano” or “mq” in the file name. This means regular samples must not have it.

eem_list <- eem_raman_normalisation2(eem_list, blank = "blank")
## A total of 1 blank EEMs will be averaged.
## Raman area: 5687348 
## Raman area: 5687348 
## Raman area: 5687348 
## Raman area: 5687348
## A total of 1 blank EEMs will be averaged.
## Raman area: 5688862 
## Raman area: 5688862 
## Raman area: 5688862
eem_overview_plot(eem_list, spp=9, contour = TRUE)
## [[1]]

Raman areas can be calculated separately with the function eem_raman_area.

5.7 Remove blanks from sample set

From this step onwards, blanks are not needed anymore. You can remove them from your sample set. eem_extract removes all samples including “nano”, “miliq”, “milliq”, “mq” or “blank” in their sample names. select of the package dplyr can do the same for the absorbance data.

eem_list <- eem_extract(eem_list, c("nano", "miliq", "milliq", "mq", "blank"),ignore_case = TRUE)
## Removed sample(s): dblank_di25se06 dblank_mq11my
absorbance <- dplyr::select(absorbance, -matches("nano|miliq|milliq|mq|blank", ignore.case = TRUE))

5.8 Remove and interpolate scattering

The function removes scattering from the samples. remove_scatter is a logical vector where values are ordered by the meaning of “raman1”, “raman2”, “rayleigh1” and “rayleigh2”. remove_scatter_width is either a number or a vector containing 4 different wavelength width in nm, one for each scatter type (Murphy et al. 2013; Lakowicz 2006).

remove_scatter <- c(TRUE, TRUE, TRUE, TRUE)
remove_scatter_width <- c(15,15,15,15)

eem_list <- eem_rem_scat(eem_list, remove_scatter = remove_scatter, remove_scatter_width = remove_scatter_width)
eem_overview_plot(eem_list, spp=9, contour = TRUE)
## [[1]]

Although it is not always necessary, removed scatter areas should be interpolated for reasonable results and faster calculation in the PARAFAC analysis (Elcoroaristizabal et al. 2015). eem_interp offers different methods of interpolation. The types of interpolation are (0) setting all NAs to 0, (1) spline interpolation with mba.points (Lee, Wolberg, and Shin 1997), (2) excitation and emission wavelength-wise interpolation with pchip (Moler 2004) and subsequent mean, (3) excitation wavelength-wise interpolation with pchip and (4) linear excitation and emission wavelength-wise interpolation with na.approx and again subsequent mean calculation. Calculating the mean is a way of ensuring NAs are also interpolated where missing boundary values would make that impossible. In case of interpolation type 1, extend = FALSE prevents the function from extrapolating. We recommend to use type 1 as a start, because its accuracy in interpolating surfaces. The smooth but still complex surface is exactly what one would expect from a theoretical EEM. In case of recognisable patterns in the residual plots close to the scatter areas or many not converging initialisations of the PARAFAC model, another interpolation method could improve the results.

eem_list <- eem_interp(eem_list, cores = cores, type = 1, extend = FALSE)
eem_overview_plot(eem_list, spp=9, contour = TRUE)
## [[1]]

5.9 Correct for dilution

Data are corrected for dilution via eem_dilution. Each EEM is multiplied with a dilution factor (e.g. 10 if 1 part sample was diluted with 9 parts ultrapure water). Either a single numeric value for all samples or a table with a specific value for each sample is used.

This step can also be done manually after calculating peaks, indices and PARAFAC components. By that, a slightly different result has to be expected if EEMs are not normalised prior to creating a PARAFAC model. In other cases there is no difference whether EEMs or results are multiplied by the dilution factor.

dil_data <- meta["dilution"]

eem_list <- eem_dilution(eem_list,dil_data)
eem_overview_plot(eem_list, spp=9) # plot spared, due to no dilution it looks like the previous plot.

5.10 Smooth data

Depending on your instrument, smoothing the data could be beneficial for peak picking. For PARAFAC analysis smoothing is not advised. The parameter n specifies the moving average window size in nm.

eem4peaks <- eem_smooth(eem_list, n = 4, cores = cores)

5.11 Overview of samples

summary prints an overview of the samples containing data ranges, applied correction methods and the manufacturer of the instrument.

summary(eem_list)
##   sample ex_min ex_max em_min em_max is_blank_corrected is_scatter_corrected
## 1 d423sf    230    455    290    702               TRUE                 TRUE
## 2 d457sf    230    455    290    702               TRUE                 TRUE
## 3 d492sf    230    455    290    702               TRUE                 TRUE
## 4 d667sf    230    455    290    702               TRUE                 TRUE
## 5 d433sf    230    455    290    702               TRUE                 TRUE
## 6 d437sf    230    455    290    702               TRUE                 TRUE
## 7 d441sf    230    455    290    702               TRUE                 TRUE
##   is_ife_corrected is_raman_normalized
## 1             TRUE                TRUE
## 2             TRUE                TRUE
## 3             TRUE                TRUE
## 4             TRUE                TRUE
## 5             TRUE                TRUE
## 6             TRUE                TRUE
## 7             TRUE                TRUE

6 Peak picking and indices

Peaks and indices known from the literature (Huguet et al. 2009; Zepp, Sheldon, and Moran 2004; McKnight et al. 2001; Ohno 2002) can be calculated.

If wavelength ranges needed for certain indices or peaks (e.g. the humification index (HIX) uses an excitation wavelength of 254 nm, but EEMs usually contain measurements at 250 and 255 nm) an interpolation is done automatically between existing measurements (no extrapolation!). This will be reported with warnings, however, these warnings can be ignored. The smoothed EEMs eem4peaks are used for the calculation. To extract individual peaks, the function eem_peaks can be used.

bix <- eem_biological_index(eem4peaks)
coble_peaks <- eem_coble_peaks(eem4peaks)
fi <- eem_fluorescence_index(eem4peaks)
hix <- eem_humification_index(eem4peaks, scale = TRUE)

indices_peaks <- bix %>%
  full_join(coble_peaks, by = "sample") %>%
  full_join(fi, by = "sample") %>%
  full_join(hix, by = "sample")

indices_peaks
##   sample       bix           b          t         a          m          c
## 1 d423sf 0.7238682 0.036238767 0.06222814 0.2799546 0.14974696 0.11645331
## 2 d457sf 0.6858719 0.023536584 0.04106616 0.2082118 0.11265412 0.08778000
## 3 d492sf 0.6869648 0.027140701 0.04730339 0.2413028 0.13198615 0.10493114
## 4 d667sf 0.6839838 0.031426888 0.05391093 0.2774084 0.14513535 0.13263827
## 5 d433sf 0.6941625 0.012110049 0.03792344 0.2147849 0.11547600 0.09000859
## 6 d437sf 0.6678838 0.006024978 0.02159146 0.1516322 0.07649198 0.06366574
## 7 d441sf 0.6670705 0.007355762 0.02692251 0.1882532 0.09387812 0.07938853
##         fi       hix
## 1 1.151716 0.8805637
## 2 1.143778 0.8923698
## 3 1.161794 0.8949828
## 4 1.139740 0.8965758
## 5 1.155606 0.9143584
## 6 1.116053 0.9420593
## 7 1.108152 0.9395073

7 Absorbance indices

abs_parms can be used to calculate a254, a300, E2:E3, E4:E6, S275-295, S350-400, S300-700, SR and the wavelength distribution of absorption spectral slopes (Helms et al. 2008; Twardowski et al. 2004; Massicotte 2016; Loiselle et al. 2009). Li and Hur (2017) give a broad overview of possible values and their applications used in recent literature. Missing wavelengths, needed for certain indices or ratios are interpolated automatically.

slope_parms <- abs_parms(absorbance, cuvl = 1, cores = cores)
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
slope_parms
##   sample      a254     a300    E2_E3    E4_E6   S275_295   S350_400   S300_700
## 1 d423sf 13.869423 6.341582 7.275041 54.45985 0.01693705 0.01767518 0.01757271
## 2 d433sf 11.629678 5.354673 7.195549 67.97273 0.01685673 0.01775750 0.01764950
## 3 d437sf  6.323142 2.836798 7.235020 39.38501 0.01750176 0.01674770 0.01719949
## 4 d441sf  7.703282 3.396527 7.572209 39.06406 0.01776943 0.01723729 0.01747484
## 5 d457sf 10.051932 4.654212 7.091301 71.26347 0.01675176 0.01752695 0.01741157
## 6 d492sf 11.652366 5.424564 7.060283 73.15475 0.01665879 0.01754663 0.01743985
## 7 d667sf 12.048233 5.542739 7.063998 35.38728 0.01648855 0.01797665 0.01702009
##          SR
## 1 0.9582393
## 2 0.9492738
## 3 1.0450242
## 4 1.0308717
## 5 0.9557717
## 6 0.9494014
## 7 0.9172203

8 Creating a PARAFAC model

Finding an appropriate PARAFAC model is an iterative process. For comparing different methods, we used the same dataset and analysis procedure as in the drEEM tutorial which is a supplement to Murphy et al. (2013).

8.1 Loading data

8.1.1 Load drEEM example dataset

For this tutorial the drEEM example data is used. The complete and corrected data is downloaded from http://models.life.ku.dk/drEEM and converted into an eemlist in R. The example PARAFAC models are calculated based on this complete sample set. You can also adapt that code to import any Matlab EEM data to staRdom.

dreem_raw <- tempfile()
download.file("http://models.life.ku.dk/sites/default/files/drEEM_dataset.zip",dreem_raw)
dreem_data <- unz(dreem_raw, filename="Backup/PortSurveyData_corrected.mat", open = "rb") %>%
  R.matlab::readMat()
unlink(dreem_raw)

eem_list <- lapply(dreem_data$filelist.eem, function(file){
  #file <- dreem_data$filelist.eem[1]
  n <- which(dreem_data$filelist.eem == file)
  file <- file %>%
    gsub("^\\s+|\\s+$", "", .) %>% # trim white spaces in filenames
    sub(pattern = "(.*)\\..*$", replacement = "\\1", .) # remove file extension from sample name
  eem <- list(file = paste0("drEEM/dataset/",file),sample = file,x = dreem_data$XcRU[n,,] %>% as.matrix(),ex = dreem_data$Ex %>% as.vector(), em = dreem_data$Em.in %>% as.vector(), location = "drEEM/dataset/")
  class(eem) <- "eem"
  attr(eem, "is_blank_corrected") <- TRUE
  attr(eem, "is_scatter_corrected") <- FALSE
  attr(eem, "is_ife_corrected") <- TRUE
  attr(eem, "is_raman_normalized") <- TRUE
  attr(eem, "manufacturer") <- "unknown"
  eem
}) %>%
  `class<-`("eemlist")

# add sample name suffix, R has sometimes troubles, when sample names start with a number.
eem_names(eem_list) <- paste0("d",eem_names(eem_list))

In the drEEM tutorial, all samples containing “bl” or “0A” are removed from the set.

ol <- function(x){x==("bl") | x == "0A"}
extract <- dreem_data$sites %>% unlist() %>% ol() %>% which()
eem_list <- eem_list %>% eem_extract(extract)

Scattering has still to be removed and is done as described above.

eem_list <- eem_rem_scat(eem_list, remove_scatter = c(TRUE, TRUE, TRUE, TRUE), remove_scatter_width = c(15,15,18,19), interpolation = FALSE, cores = cores)

If you have worked with the basic analysis template, you can use the resulting data right away. In the case you did the corrections with different sample sets separately and want to combine them, you can use eem_import_dir to combine EEM samples from several RData or RDa files. Put all files in one folder and run the following:

eem_list <- eem_import_dir(dir)

Due to package size issues, no example data is included for this function.

We recommend using eem_checkdata again before continuing the further analysis!

8.2 Sample set wavelength ranges

For a PARAFAC analysis, all samples of the set need to be similar in terms of provided wavelength pairs. In case you have deviating samples there are ways to solve this:

8.3 Find and remove noise in EEM data

An important way of finding noise in EEM data is viewing the samples. You can use the function eem_overview_plot to view all your samples in a convenient way. With the following functions, data can be removed or set to NA (NAs are theoretically no problem for the PARAFAC algorithm, but the calculation process and the results are often more convient without) in different ways, or be interpolated, depending on what you would like to do:

Sample “d667sf” will be used to show this process graphically. To extract the sample the regular expression “^d667sf$” is used. ^ stands for the beginning of the string and $ for the end and ensure an exact match. If you do not use these characters, all samples containing the string in their name are extracted.

This is where we start from:

eem_list %>% 
  eem_extract(sample = "^d667sf$", keep = TRUE) %>%
  ggeem(contour = TRUE)

The noisy range below 250 nm excitation and above 580 nm emission can be removed from the samples with the following command. As mentioned above, this was already removed in the example data.

eem_list <- eem_list %>% eem_range(ex = c(250,Inf), em = c(0,580))

Visually found irregularities in patterns are manually replaced by NA and interpolated. From the sample “d667sf” a band covering the excitation wavelengths from 245 to 350 is removed and a rectangle covering emission wavelengths 560 to 576 and excitation wavelengths 280 to 295 is removed in all samples. For demonstration reasons, the interpolation is done in an extra step but can also be included in the removing function right away (interpolate = TRUE).

eem_list <- eem_list %>%
  eem_setNA(sample = 176, ex = 345:350, interpolate = FALSE) %>%
  eem_setNA(em = 560:576, ex = 280:295, interpolate = FALSE)
eem_list <- eem_interp(eem_list, type = 1, extend = FALSE, cores = cores)

Here, we show the changes, made to sample d667sf by the demonstrated steps above (including the removing of the scatter):

8.4 Explore dataset

It is crucial to find an appropriate number of components in the analysis. If the number of components is too large, it may be the case that one component was split into two; if it is too low, relevant components may get lost (Murphy et al. 2013). To help you find the appropriate number, a series of PARAFAC models can be calculated and compared. In this example 5 models with 3 to 7 components are calculated.

The alternating-least-squares algorithm used to calculate the PARAFAC model optimises the components in order to minimise the residual error. The goal is to find a global minimum of the residual function. Depending on the random start values (random initialisation), different local minima can be found. To get the global minimum, a defined number nstart of initialisations is used for separate model calculations. The model with the smallest error is then used for further analyses, assuming it is a global minimum. 25 might by a good start for nstart although for a profound analysis higher values (e.g. 50) are suggested.

Some of these random initialisations might not converge. If there are still enough converging models, the result is reliable and there is no problem. eem_parafac returns a warning in case of less than 50 % converging models. Furthermore, it is possible to calculate a specific number of converging models by setting the argument strictly converging = TRUE. When using this, due to the iterative restarting of not converging models, this takes more time and using more starts from the beginning is a faster way to get a reasonable number of converging models.

You can speed up the calculations by using multiple cores. Beware, that calculating a PARAFAC model can take some time!

Higher maxit (maximum number of iteration steps in PARAFAC) and lower ctol (tolerance to return result of PARAFAC, should not be larger than 10⁻⁶) increase the accuracy of the model but take more computation time. For a final model, we suggest to use a tolerance of 10⁻⁸ to 10⁻¹⁰.

In this first model creation step, the constraints are tested. While pf1 is calculated without any constraints, pf1n uses the assumption that modes are non-negative only. This is a very common assumption, because fluorescence cannot be negative. Common constraints are none (“uncons”), non-negative (“nonneg”) and unimodal, non-negative (“uninon”). Besides these, other possible constraints can be seen using the command CMLS::const().

PARAFAC modes can be rescaled so that their maximum is 1 and effects are more visible in the plots. The rescaling is corrected in the A mode (sample mode). In case of uneven peak heights, eempf_rescaleBC can help to improve the visibility of your graphs. The parameter newscale specifies the root mean-squared error of each column in matrices B and C. This is compensated in the A mode (sample mode). Alternatively newscale can be set "Fmax"; each peak has a height of 1 then.

The PARAFAC models created below can be loaded from the staRdom package’s example data instead of calculating it:

data(pf_models)
# minimum and maximum of numbers of components
dim_min <- 3
dim_max <- 7
nstart <- 25 # number of similar models from which best is chosen
maxit = 5000 # maximum number of iterations in PARAFAC analysis
ctol <- 10^-6 # tolerance in PARAFAC analysis

# calculating PARAFAC models, one for each number of components
pf1 <- eem_parafac(eem_list, comps = seq(dim_min,dim_max), normalise = FALSE, const = c("uncons", "uncons", "uncons"), maxit = maxit, nstart = nstart, ctol = ctol, cores = cores)

# same model but using non-negative constraints
pf1n <- eem_parafac(eem_list, comps = seq(dim_min,dim_max), normalise = FALSE, const = c("nonneg", "nonneg", "nonneg"), maxit = maxit, nstart = nstart, ctol = ctol, cores = cores)

# rescale B and C modes to a maximum fluorescence of 1 for each component
pf1 <- lapply(pf1, eempf_rescaleBC, newscale = "Fmax")
pf1n <- lapply(pf1n, eempf_rescaleBC, newscale = "Fmax")

Use eempf_compare to plot the created models’ components. You can see the models fits and components (rows) for models with different numbers of components (columns) in 2 different views. The single plots can be created using eempf_fits and eempf_plot_comps.

# This plot is not shown, because the components violate the assumptions for fluorescence peaks (negative fluorescence). Please try, if you are interested.
eempf_compare(pf1, contour = TRUE)
eempf_compare(pf1n, contour = TRUE)