GeRnika is an R package for the simulation of tumor clonal data and the visualization and comparison of tumor phylogenies. In this document we delve into the data simulation functionality, so that advanced users can have a higher level of control over the way instances are generated.

Overview of the data instances

Each instance simulated by GeRnika consists of 4 numerical matrices \(\boldsymbol{F}\), \(\boldsymbol{F_{true}}\), \(\boldsymbol{U}\) and \(\boldsymbol{B}\), that relate to each other so that:

\[\begin{equation} \boldsymbol{F_{true}} = \boldsymbol{U} · \boldsymbol{B} \end{equation}\]

This equation arises within the Variant Allele Frequency Factorization Problem (VAFFP) formulation of the Clonal Deconvolution and Evolution Problem (CDEP), and essentially states that the \(n\) mutation frequencies in a series of \(s\) tumor samples (matrix \(\boldsymbol{F_{true}}\)) are the result of the combination of the tumor clonal structure, represented by a tree (matrix \(\boldsymbol{B}\)), and the clone proportions captured in each sample (matrix \(\boldsymbol{U}\)). Specifically, these matrices are:

\(\boldsymbol{F_{true}} \in [0, 1]^{s \times n}\) matrix: It is a matrix \(s \times n\), and it contains the mutation frequency (VAF) values of the \(n\) mutations present in \(s\) tumor biopsies or samples.

\(\boldsymbol{B} \in \{0, 1\}^{n \times n}\) matrix: It is a binary square matrix of size \(n\) where \(b_{ij} = 1\) means that clone \(i\) contains the mutation \(j\). It represents the phylogeny of the tumor.

\(\boldsymbol{U} \in [0, 1]^{s \times n}\) matrix: It is a matrix \(s \times n\) where \(u_{ij}\) is the fraction of clone \(j\) in sample \(i\).

Now, all this holds in a noise-free scenario. However, we do know that in real life there are many factors that make the VAF values noisy, which essentially means that the VAF values of the mutations do no longer exactly correspond to the sum of the sample fractions of all those clones that contain that mutation. In this R package we have only considered the noise added by DNA sequencing procedures to the VAF values and we have translated it into the \(\boldsymbol{F} \in [0, 1]^{s \times n}\) matrix. Thus, we have that the product of the matrices \(\boldsymbol{U}\) and \(\boldsymbol{B}\) may not equal \(\boldsymbol{F}\):

\[\begin{equation} \boldsymbol{F} \neq \boldsymbol{U} · \boldsymbol{B} \end{equation}\]

The VAFFP formulation works under a few assumptions. The first one is that tumors have a monoclonal origin, i.e., they arise from a single abnormal cell that that at some point began to divide uncontrollably and founded the tumor. The second one is the infinite sites assumption (ISA), according to which a given mutation arises at most once over the course of evolution of the tumor, and mutations can not disappear. Any data generated using this package will adhere to those assumptions.

Importantly, even if the data simulation functionality in GeRnika was primarily devised for creating instances for the CDEP, the output or the byproducts of the data models on it can definitely be used in any other application in which this data may be useful. With this in mind, we have implemented the procedures in such a way that they are highly customizable by the user. We explain this usage throughout this document.

Overview of the data parameters

GeRnika allows the user to set parameters related to the evolution of a tumor, the sampling procedure and the data acquisition process, and include the number of clones in the tumor, number of biopsies taken from the tumor, evolution model, evolutionary tree topology and sequencing depth. More details can be found in the following sections.

Basic instantiation

The general function to simulate an instance is create_instance. This function has the following parameters:

Parameter Description Type
n Number of mutations/clones (\(n\)). Natural number
m Number of samples (\(s\)). Natural number
k Topology parameter that controls for the linearity of the topology. Positive rational number
selection Evolution model followed by the tumor. Categorical: “positive” or “neutral”
noisy Add sequencing noise to the values in the \(F\) matrix. Boolean
depth (only if noise = TRUE) Average number of reads that map to the same locus, also known as sequencing depth. Natural number
seed Seed for the pseudo-random topology generator Real number

For instance, if we want to create a noise-free instance with 4 samples obtained from a tumor with 10 clones, evolving under neutral evolution and with a uniformly random topology, we can type the following:

I1 <- create_instance(n = 10, m = 4, k = 1, selection = "neutral", noisy = FALSE, seed = 1)
I1
#> $F
#>         mut1 mut2 mut3 mut4 mut5 mut6 mut7 mut8 mut9 mut10
#> sample1 0.00 0.06 0.00 0.06 0.00 0.94 0.06 0.06    1  0.00
#> sample2 0.12 0.12 0.12 0.43 0.00 0.45 0.42 0.07    1  0.06
#> sample3 0.19 0.03 0.06 0.16 0.03 0.00 0.03 0.00    1  0.05
#> sample4 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00    1  0.00
#> 
#> $B
#>         mut1 mut2 mut3 mut4 mut5 mut6 mut7 mut8 mut9 mut10
#> clone1     1    0    0    0    0    0    0    0    1     0
#> clone2     0    1    0    1    0    0    1    0    1     0
#> clone3     1    0    1    0    0    0    0    0    1     0
#> clone4     0    0    0    1    0    0    0    0    1     0
#> clone5     0    1    0    1    1    0    1    0    1     0
#> clone6     0    0    0    0    0    1    0    0    1     0
#> clone7     0    0    0    1    0    0    1    0    1     0
#> clone8     0    1    0    1    0    0    1    1    1     0
#> clone9     0    0    0    0    0    0    0    0    1     0
#> clone10    1    0    1    0    0    0    0    0    1     1
#> 
#> $U
#>         clone1 clone2 clone3 clone4 clone5 clone6 clone7 clone8 clone9 clone10
#> sample1   0.00   0.00   0.00   0.00   0.00   0.94    0.0   0.06   0.00    0.00
#> sample2   0.00   0.05   0.06   0.01   0.00   0.45    0.3   0.07   0.00    0.06
#> sample3   0.13   0.00   0.01   0.13   0.03   0.00    0.0   0.00   0.65    0.05
#> sample4   0.99   0.00   0.00   0.00   0.00   0.00    0.0   0.00   0.01    0.00
#> 
#> $F_true
#>         mut1 mut2 mut3 mut4 mut5 mut6 mut7 mut8 mut9 mut10
#> sample1 0.00 0.06 0.00 0.06 0.00 0.94 0.06 0.06    1  0.00
#> sample2 0.12 0.12 0.12 0.43 0.00 0.45 0.42 0.07    1  0.06
#> sample3 0.19 0.03 0.06 0.16 0.03 0.00 0.03 0.00    1  0.05
#> sample4 0.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00    1  0.00

In this case, \(\boldsymbol{F}\) equals \(\boldsymbol{F_{true}}\).

If we instead want the instance to be noisy, we just need to adjust the noisyand depth parameters:

I1 <- create_instance(n = 10, m = 4, k = 1, selection = "neutral", noisy = TRUE, 
                      depth = 100, seed = 2)
I1
#> $F
#>              mut1       mut2      mut3       mut4 mut5      mut6       mut7
#> sample1 0.3583333 0.22222222 0.0000000 0.01234568    1 0.4549763 0.00000000
#> sample2 0.2649573 0.05084746 0.0000000 0.01470588    1 0.3809524 0.00000000
#> sample3 0.5555556 0.00000000 0.6368715 0.60416667    1 0.1052632 0.09302326
#> sample4 0.9655172 0.00000000 0.9291339 0.98039216    1 0.0000000 0.00000000
#>              mut8 mut9      mut10
#> sample1 0.4927536 0.00 0.12765957
#> sample2 0.4565217 0.00 0.05555556
#> sample3 0.0000000 0.06 0.22033898
#> sample4 0.0000000 0.00 0.08421053
#> 
#> $B
#>         mut1 mut2 mut3 mut4 mut5 mut6 mut7 mut8 mut9 mut10
#> clone1     1    0    0    0    1    0    0    0    0     0
#> clone2     0    1    0    0    1    0    0    0    0     1
#> clone3     1    0    1    1    1    0    0    0    0     0
#> clone4     1    0    0    1    1    0    0    0    0     0
#> clone5     0    0    0    0    1    0    0    0    0     0
#> clone6     0    0    0    0    1    1    0    0    0     0
#> clone7     0    0    0    0    1    1    1    0    1     0
#> clone8     0    0    0    0    1    1    0    1    0     0
#> clone9     0    0    0    0    1    1    0    0    1     0
#> clone10    0    0    0    0    1    0    0    0    0     1
#> 
#> $U
#>         clone1 clone2 clone3 clone4 clone5 clone6 clone7 clone8 clone9 clone10
#> sample1   0.36   0.20   0.00   0.02   0.00   0.00   0.00   0.42   0.00    0.00
#> sample2   0.27   0.09   0.00   0.01   0.23   0.07   0.01   0.32   0.00    0.00
#> sample3   0.00   0.00   0.61   0.00   0.02   0.00   0.12   0.00   0.04    0.21
#> sample4   0.00   0.00   0.94   0.00   0.00   0.00   0.00   0.00   0.00    0.06
#> 
#> $F_true
#>         mut1 mut2 mut3 mut4 mut5 mut6 mut7 mut8 mut9 mut10
#> sample1 0.38 0.20 0.00 0.02    1 0.42 0.00 0.42 0.00  0.20
#> sample2 0.28 0.09 0.00 0.01    1 0.40 0.01 0.32 0.01  0.09
#> sample3 0.61 0.00 0.61 0.61    1 0.16 0.12 0.00 0.16  0.21
#> sample4 0.94 0.00 0.94 0.94    1 0.00 0.00 0.00 0.00  0.06

This time, the matrices \(\boldsymbol{F}\) and \(\boldsymbol{F_{true}}\) are no longer the same.

Advanced instantiation

The previous section describes how to generate instances in an easy and rapid way. However, some users require more precise control over the data, and this section is directed towards them. To begin with, we provide an overview of the structure of Gernika’s data simulation functionality. Subsequently, we delve into each element in detail and see what functions can be used to control the simulation of those aspects.

Digging into the models

In order to simulate the \(\boldsymbol{B}\) and \(\boldsymbol{U}\) matrices, Gernika relies in two models: a tumor model that simulates the evolutionary history and current state of the tumor and a sampling model that represents the tumor sampling process. Once that these two matrices have been calculated, the matrix \(\boldsymbol{F_{true}}\) is obtained by multiplying them. There is a last model which is a sequencing noise model that can optionally be used if noisy data is desired; this adds sequencing noise to the VAF values in the \(\boldsymbol{F_{true}}\) matrix, producing the \(\boldsymbol{F}\) matrix. We analyze each of these models in detail in the coming lines.

Tumor model

The tumor model generates a clonal tree \(T\) and an associated matrix \(\boldsymbol{B}\), together with the clone proportions \(\boldsymbol{c}\) and tumor blend at the moment of sampling. Briefly, the clonal tree \(T\) that represents the development of a tumor with a set of mutations \(M\) so that |\(M\)| = \(n\), is a rooted tree on an \(n\)-sized vertex set \(V_n = \{v_{1}, \dots, v_{n} \}\) and an \(n-1\)-sized edge set \(E_T = \{e_{ij} \}\), where \(v_{i}\) corresponds to the vertex or clone \(i\) and an edge \(e_{ij}\) between two vertices \(v_{i}\) and \(v_{j}\) represents a direct ancestral relationship.

In the first place, an \(n\)-sized tree \(T\) with a random topology is created in the following manner. First, the root node of \(T\), \(\mathcal R(T)\), is set and a random mutation \(M_i \in M\) is assigned to it. Then, for each of the remaining \(M_j \in M - \{M_i\}\) mutations, a new node \(v_j\) is created; then the mutation \(M_j\) is assigned to it and the node is attached as a child of an existing node in \(T\). In order to meet the ISA model, each of the newly added nodes inherits all the mutations present in its parent node.

The attachment of nodes to the tree is not uniformly at random; instead, the nodes in the \(T\) that is being built have different probabilities of being chosen as parent nodes for the new nodes, depending on the number of ascendants, \(\mathcal A(v_i)\), they have. Mathematically speaking, \(\forall M_j \neq \mathcal R(T)\), its parent node \(\mathcal P(M_j)\) is sampled from \(V_n\) following a multinomial distribution:

\[\begin{equation} \mathcal P(M_j) \sim M(n = \textrm{1}, \boldsymbol{p} = \boldsymbol{p}(v_i, k)); v_i \in V_n \label{eq:pk} \end{equation}\]

where

\[\begin{equation} \boldsymbol{p}(v_i, k) \propto k^{(|\mathcal A(v_i) | + 1)}; v_i \in V_T \end{equation}\]

\(k \in (0, +\infty)\) is the topology parameter that determines if the topology is more branched, with a decreasing probability for increasing numbers of ascendants (\(k\) \(<\) 1), or more linear, with an increasing probability for increasing numbers of ascendants (\(k\) \(>\) 1). At the same time, setting \(k\) to 1 assigns equal probabilities to all the nodes and generates a completely random topology. The relationship between \(k\) and the topology can graphically be seen in the following figure:

 Function that determines the frequency of selection of an existing node in the tree $T$ as the parent node of new children. The function is dependant on the number of ascendants and the topology parameter $k$.

Function that determines the frequency of selection of an existing node in the tree \(T\) as the parent node of new children. The function is dependant on the number of ascendants and the topology parameter \(k\).

This procedure to obtain a tree \(T\) and one of the possible \(\boldsymbol{B}\) matrices associated to it is implemented in the function create_B. This function has no parameters other than n and k described in the section Basic instantiation so we do not recommend changing it; we suggest to just adjust the value of k depending on the desired topology.

Once \(\boldsymbol{B}\) is obtained, the proportions of the clones in the tumor \(\boldsymbol{c}\) are simulated. We stress that these proportions \(c_i\) refer to the moment of the sampling because, due to the dynamic nature of tumor growth and evolution, these proportions need not be the same at any two time instants. It is also worth mentioning here that these proportions are not the ones that show in the \(\boldsymbol{U}\) matrix, which are the clone proportions, but the ones that do not depend on any sampling.

These clone proportions \(\boldsymbol{c}\) are calculated by sampling a Dirichlet distribution in each multifurcation of \(T\). For instance, for a node \(v_i\) with children \(\mathcal K(v_i)\) = {\(v_j\), \(v_k\)}, we draw one sample \(\{x_i, x_j, x_k\}\) that represents the proportions of the parent clone and its two children from \(Dir(\alpha_i, \alpha_j, \alpha_k)\), respectively. When this sampling is performed in an internal node of the tree, i.e., not in the root node, these proportions are scaled to the original proportion of the parent clone so that once all the multifurcations have been visited, the proportions of all the clones in \(T\) sum up to one.

The parameters of the Dirichlet distribution depend on the evolution model of the tumor. In GeRnika we consider two basic cases: positive selection-driven evolution or neutral evolution. According to positive selection-driven evolution, some mutations provide growth advantage whereas most of them do not, so the former clones get over other existing clones and take up the biggest part of the tumor. This results in tumors dominated by few clones, while the rest of clones are observed in minor proportions. Under neutral evolution, instead, there is not a significant number of mutations that provide fitness advantage and clones accumulate just out of tumor progression, so all clones are present in similar proportions

Based on this, the \(\alpha\) parameters for the Dirichlet distribution for positive selection-driven evolution have been set to \(\boldsymbol{\alpha}\) = 0.3, and for the neutral evolution to \(\alpha_{p}\) = 5 and \(\boldsymbol{\alpha}_{c}\) = 10 . We use different alpha values for parent and children nodes in the neutral evolution so that clones that result from multiple multifurcations do not end up with smaller proportions just because of their position in the topology, ruining the expected clone proportion distribution for this type of evolution model. These values have been selected empirically and their effect is depicted in the following figure, which shows how 5,000 random samples from the mentioned Dirichlet distributions (for the particular case of 3 dimensions, i.e., one parent and two children nodes) would be distributed on a 2-simplex. As can be seen, for positive selection (\(\boldsymbol{\alpha}\) = 0.3), the \(\{x_i, x_j, x_k\}\) values are pushed towards the corners of the simplex (\(\alpha_i\) is \(<\) 1) in an uniform manner (all \(\alpha_i\) are equal); in other words, the samples tend to be sparse; usually one component has a large value and the rest have values close to 0. Instead, when the neutral selection is adopted (\(\boldsymbol{\alpha}\) = [5, 10, 10]), the values in \(\{x_i, x_j, x_k\}\) concentrate close to the middle of the simplex (all \(\alpha_i\) are \(>\) 1) but tend to deviate to those components with larger \(\alpha\). This means that samples \(\{x_i, x_j, x_k\}\) are less sparse, with larger values for \(x_2\) and \(x_3\) in this case, which represent the children nodes.