# CIARA

CIARA (Cluster Independent Algorithm for the identification of RAre cell types) is an R package that identifies potential markers of rare cell types looking at genes whose expression is confined in small regions of the expression space. It is possible to use these highly localized genes as features in standard cluster algorithm (i.e. Louvain), for identifying extremely rare population(3/4 cells from thousand of cells)

## Installation

For installation, please use the following command:

The main function of the package is **CIARA_gene** and **CIARA**

## CIARA_gene

CIARA_gene(norm_matrix, knn_matrix, gene_expression, p_value = 0.001, odds_ratio=2, local_region = 1, approximation = FALSE)
requires as input:

1. **norm_matrix**: Norm count matrix (n_genes x n_cells)
2. **knn_matrix**: K-nearest neighbors matrix (n_cells x n_cells)
3. **gene_expression**: numeric vector with the gene expression (length equal to n_cells). The gene expression is binarized (equal to 0/1 in the cells where the value is below/above the median)
4. **p_value**: maximum p value (returned by the R function fisher.test with parameter alternative = "g") for considering a local region enriched
6. **odds_ratio**: minimum odds ratio (returned by the R function **fisher.test** with parameter alternative = "g") above which a local region is considered enriched
5. **local_region**: minimum number of local regions (cell with its knn neighbours) where the binarized gene expression is enriched in 1
7. **approximation**.Logical.For a given gene, the fisher test is run in the local regions of only the cells where the binarized gene expression is 1

The gene expression is binarized (1/0) if the value in a given cell is above/below the median. Each of cell with its first K nearest neighbors defined a local region. If there are at least **local_region** enriched in 1 according **to fisher.test** (with p value below than **p_value** and odds ratio above or equal to **odds_ratio**) , then the entropy for the gene is computed starting from the probability of having 1/0. The minimum of the entropy across all the enriched local regions is the entropy of mixing. If there are no enriched local regions, then the entropy of mixing  and the p value by default are set to 1
The output of **CIARA_gene**  is a list with one element corresponding to the p value of the gene

## CIARA

CIARA(norm_matrix, knn_matrix, background, cores_number = 1, p_value = 0.001, odds_ratio = 2,local_region = 1, approximation = FALSE) 
requires as input:

1. **norm_matrix**: Norm count matrix (n_genes x n_cells)
2. **knn_matrix**: K-nearest neighbors matrix (n_cells x n_cells)
3. **background**: Vector of genes for which the function **CIARA_gene** is run
4. **cores_number**: Integer.Number of cores to use.
5. **p_value**: maximum p value (returned by the R function fisher.test with parameter alternative = "g") for considering a local region enriched
6. **odds_ratio**: minimum odds ratio (returned by the R function fisher.test with parameter alternative = "g") above which a local region is considered enriched
7. **local_region**: minimum number of local regions (cell with its knn neighbours) where the binarized gene expression is enriched in 1
8. **approximation**.Logical.For a given gene, the fisher test is run in the local regions of only the cells where the binarized gene expression is 1

Return a dataframe with n_rows equal to the length of **background** . Each row is the output from **CIARA_gene**.

The vector of genes for which the function **CIARA_gene** is run can be obtained with the function **Get_background_full**.
This function gives as output a vector with all genes expressed at a level higher than **threshold** in a number of cells between **n_cells_low** and **n_cells_high**

An example of input could be:

load(file = “norm_counts.Rda”) load(file = “knn_matrix.Rda”) background <- get_background_full(norm_counts, threshold = 1, n_cells_low = 3, n_cells_high = 20) result <- CIARA(norm_matrix, knn_matrix, background, cores_number = 1, min_p_value = 0.001, odds_ratio = 2, local_region = 1, approximation = FALSE)

The two input files *norm_counts* and *knn_matrix* can be obtained from a seurat object:

# step 4

result_test <- test_hvg(raw_counts,final_cluster, ciara_genes, background, number_hvg = 100, min_p_value = 0.05) result_test[[2]] “Endoderm” “Hemogenic Endothelial Progenitors”. “ExE Mesoderm” # We need to do sub cluster in the above three clusters raw_endoderm <- raw_counts[, human_embryo_cluster == “Endoderm”] raw_emo <- raw_elmir[, human_embryo_cluster == “Hemogenic Endothelial Progenitors”] raw_exe_meso=raw_elmir[, human_embryo_cluster == “ExE Mesoderm”] combined_endoderm <- Cluster_analysis_sub(raw_endoderm, 0.2, 5, 30, “Endoderm”) combined_exe_meso <- Cluster_analysis_sub(raw_exe_meso, 0.5, 5, 30, “ExE Mesoderm”) combined_emo <- Cluster_analysis_sub(raw_emo, 0.6, 5, 30, “Hemogenic Endothelial Progenitors”)

all_sub_cluster <- c(combined_endoderm$$seurat_clusters, combined_emo$$seurat_clusters, combined_exe_meso\$seurat_clusters) final_cluster_version_sub <- merge_cluster(final_cluster, all_sub_cluster)

plot_umap(coordinate_umap, final_cluster_version_sub)

<img src="https://github.com/ScialdoneLab/CIARA/blob/main/figures/entropy_cluster.png" width="700" height="500">
With the cluster analysis based on CIARA we are able to detect two clusters (Endoderm_2 and and Hemogenic Endothelial Progenitors_4 highlighted in the plot) that were not reported in the original paper.

For more exhaustive information about the functions offered by CIARA for the identification of rare populations of cells  see **Tutorials section** below and the help page of the single functions. (*?function_name*).

## Vignette

The following vignette is available and completely reproducible. It uses single cell RNA seq from human embryo at the gastrulation state from [Tyser *et al.*, 2021](https://www.nature.com/articles/s41586-021-04158-y). The raw count matrix was downloaded from  [http://human-gastrula.net].
An extremely rare population of primordial germ cells (PGCs-7 cells) is easily identified with entropy of mixing.
It can be accessed within R with:

utils::vignette(“CIARA”) `