`SimReg`

has a function `sim_reg`

for performing *Bayesian Similarity Regression* [1]. It returns an estimated probability of an association between ontological term sets and a binary variable, and conditional on the association: a *characteristic ontological profile*, such that ontological similarity to the profile increases the probability of the binary variable taking the value `TRUE`

. The procedure has been used in the context of linking an ontologically encoded phenotype (as HPO terms) to a binary genotype (indicating the presence or absence of a rare variant within given genes) [1], so in this guide, we'll adopt the same theme.

The function accepts arguments including `logical`

response variable `y`

, ontologically encoded predictor variable `x`

, and additional arguments for tuning the compromise between execution speed and precision for the procedure.

It returns an object of class `sim_reg_output`

, which contains the pertinent information about the results of the inference. Of particular interest is the estimated probability of association, i.e. the probability that model selection indicator `gamma`

= 1. The function `prob_association`

can be used on the output to obtain such and estimate. Also, the posterior distribution of the characteristic ontological profile `phi`

may be of interest, for which the function `get_term_marginals`

can be used.

To set up an environment where we can run a simple example, we need an `ontology_index`

object. The `ontologyIndex`

package contains such an object - the Human Phenotype Ontology, `hpo`

- so we load the package and the data, and proceed to create an HPO profile `template`

and an enclosing set of terms, `terms`

, from which we'll generate random term sets upon which to apply the function. In our setting, we'll interpret this HPO profile `template`

as the typical phenotype of a hypothetical disease. We set `template`

to the set `HP:0005537, HP:0000729`

and `HP:0001873`

, corresponding to phenotype abnormalities 'Decreased mean platelet volume', 'Autistic behavior' and 'Thrombocytopenia' respectively.

```
library(ontologyIndex)
library(SimReg)
data(hpo)
set.seed(1)
template <- c("HP:0005537", "HP:0000729", "HP:0001873")
terms <- get_ancestors(hpo, c(template, sample(hpo$id, size=50)))
```

First, we'll do an example where there is no association between `x`

and `y`

, and then one where there is an association.

In the example with no association, we'll fix `y`

, with 10 `TRUE`

s and generate the `x`

randomly, with each set of ontological terms determined by sampling 5 random terms from `terms`

.

```
y <- c(rep(TRUE, 10), rep(FALSE, 90))
x <- replicate(simplify=FALSE, n=100, expr=minimal_set(hpo, sample(terms, size=5)))
```

Thus, our input data looks like:

`y`

```
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE
```

`head(x)`

```
## [[1]]
## [1] "HP:0040224" "HP:0011458" "HP:0011328" "HP:0008386" "HP:0009025"
##
## [[2]]
## [1] "HP:0000235" "HP:0001315" "HP:0009783" "HP:0011792" "HP:0003115"
##
## [[3]]
## [1] "HP:0002493" "HP:0001928" "HP:0008490" "HP:0001392" "HP:0000935"
##
## [[4]]
## [1] "HP:0001476" "HP:0030133" "HP:0007477" "HP:0100261"
##
## [[5]]
## [1] "HP:0000614" "HP:0000525" "HP:0000270" "HP:0012140" "HP:0011799"
##
## [[6]]
## [1] "HP:0000235" "HP:0001018" "HP:0000935" "HP:0003468" "HP:0040068"
```

Now we can call the `sim_reg`

function to estimate the probability of an association (note: by default, the probability of an association has a prior of 0.05 and this can be set by passing a `gamma_prior_prob`

argument), and print the mean posterior value of `gamma`

, corresponding to our estimated probability of association.

```
no_assoc <- sim_reg(ontology=hpo, x=x, y=y)
prob_association(no_assoc)
```

`## [1] 0.0001543326`

We note that there is a low probability of association. Now, we sample `x`

conditional on `y`

, so that if `y[i] == TRUE`

, then `x`

has 2 out of the 3 terms in `template`

added to its profile.

```
x_assoc <- lapply(y, function(y_i) minimal_set(hpo, c(
sample(terms, size=5), if (y_i) sample(template, size=2))))
```

If we look again at the first few values in `x`

for which `y[i] == TRUE`

, we notice that they contain terms from the template.

`head(x_assoc)`

```
## [[1]]
## [1] "HP:0012638" "HP:0030669" "HP:0100261" "HP:0100491" "HP:0011314"
## [6] "HP:0005537" "HP:0001873"
##
## [[2]]
## [1] "HP:0030680" "HP:0100240" "HP:0009126" "HP:0000236" "HP:0000729"
## [6] "HP:0001873"
##
## [[3]]
## [1] "HP:0011554" "HP:0030178" "HP:0003115" "HP:0030332" "HP:0003474"
## [6] "HP:0001873" "HP:0005537"
##
## [[4]]
## [1] "HP:0011004" "HP:0009126" "HP:0003119" "HP:0040069" "HP:0000729"
## [6] "HP:0005537"
##
## [[5]]
## [1] "HP:0011792" "HP:0030332" "HP:0011182" "HP:0000598" "HP:0100240"
## [6] "HP:0001873" "HP:0005537"
##
## [[6]]
## [1] "HP:0006915" "HP:0001634" "HP:0000759" "HP:0002493" "HP:0005537"
## [6] "HP:0001873"
```

Now we run the procedure again with the new `x`

and `y`

and print the mean posterior value of `gamma`

.

```
assoc <- sim_reg(ontology=hpo, x=x_assoc, y=y)
prob_association(assoc)
```

`## [1] 0.9999955`

We note that we infer a higher probability of association. We can also visualise the estimated characteristic ontological profile, using the function `plot_term_marginals`

, and note that the inferred characteristic phenotype corresponds well to `template`

.

`plot_term_marginals(hpo, get_term_marginals(assoc), max_terms=8, fontsize=30)`

- D. Greene, NIHR BioResource, S. Richardson, E. Turro, Phenotype similarity regression for identifying the genetic determinants of rare diseases, The American Journal of Human Genetics 98, 1-10, March 3, 2016.