PCAmatchR matches cases to controls based on genotype principal components (PC). In order to produce more genetically similar matches, a weighted Mahalanobis distance metric (Kidd et al. (1987)) is used to determine matches. Weights are equal to the percent variance explained by each PC.
The release version of PCAmatchR can be installed from CRAN:
The development version of the PCAmatchR package can be installed from the GitHub repository by using the devtools package:
PCAmatchR depends on the optmatch package which must be manually installed from CRAN:
Following installation, attach the PCAmatchR and optmatch packages with:
library(PCAmatchR) library(optmatch) setMaxProblemSize(size = Inf) # optmatch option to allow for large matching problems. See optmatch documentation for full description.
Here we perform a hypothetical example of case-control matching using the Phase 3 data release of 1000 Genomes Project, which contains genotype data from 2,504 individuals from 26 distinct populations (available at https://www.cog-genomics.org/plink/2.0/resources).
Within PCAmatchR, we include sample data containing information about population, gender, and the first 20 principal components and eigenvalues calculated from 2,504 individuals in the Phase 3 data release of the 1000 Genomes Project. The example principal component analysis was conducted with PLINK using a set of ancestry informative SNPs (Yu et al. (2008)). The data files are contained within:
PCs_1000G, A sample dataset containing information about population, gender, and the first 20 principal components calculated from 2,504 individuals in the Phase 3 data release of the 1000 Genomes Project.
eigenvalues_1000G, A sample dataset containing the first 20 eigenvalues calculated from 2,504 individuals in the Phase 3 data release of the 1000 Genomes Project.
eigenvalues_all_1000G, A sample dataset containing all eigenvalues calculated from 2,504 individuals in the Phase 3 data release of the 1000 Genomes Project.
# Load required packages <- c("PCAmatchR", "optmatch") loadedPackages invisible(lapply(loadedPackages, require, character.only = TRUE)) # Create PC data frame <- as.data.frame(PCs_1000G[,c(1,5:24)]) pcs # Create eigenvalues vector <- c(eigenvalues_1000G)$eigen_values eigen_vals # Create full eigenvalues vector <- c(eigenvalues_all_1000G)$eigen_valuesall_eigen_vals
For this example analysis, we select individuals from the ESN (Esan in Nigeria) population as cases (N=99), while all remaining samples are used as the control population (N=2,405):
<- as.data.frame(PCs_1000G[,1:4]) covariate_data$case <- ifelse(covariate_data$pop=="ESN", c(1), c(0))covariate_data
Matching is performed using
match_maker. Within this example, cases and controls are 1:2 matched on the first 20 PCs:
<- match_maker(PC = pcs, match_maker_outputeigen_value = eigen_vals, data = covariate_data, ids = c("sample"), case_control="case", num_controls = 2, eigen_sum = sum(all_eigen_vals))
Derived matches are contained within the
matches object. The weighted Mahalanobis distance metric between case and control pairs is detailed within the
<- match_maker_output$matches PCA_matches$match_distancePCA_matches
all_eigen_vals file is not needed to run
match_maker. The user can directly supply the scalar sum of all eigen values to the
match_maker function through the
eigen_sum input. In the above example, the sum of all the eigenvalues was 2722.856. This value is used to weight each eigen value to determine the percentage of variance explained.
If using PLINK to perform the PCA (e.g.,
--pca flag), there are multiple options to calculate the sum of all eigen values instead of having to generate all PCs:
--make-relin PLINK. The diagonal of this matrix is equal to the eigen values. Summing the diagonal will obtain the integer to input into
eigen_sum. This approach may not be feasible when working with a large number of cases and controls, due to the file size.
--ibcin PLINK. This will output a file with 6 columns. Column 4 (
Fhat1), gives the variance-standardized relationship minus 1. Add 1 to each entry in this column, and then sum all entries. This will result in the total sum of the eigen values (i.e., the total variance explained). This value can then be used as the input to
The distance between matches can be visualized using
plot_maker(data=match_maker_output, x_var="PC1", y_var="PC2", case_control="case", line=F, xlim = c(0.025,0.04), ylim = c(-0.008,0.00))
The function further allows for connections between matches:
plot_maker(data=match_maker_output, x_var="PC1", y_var="PC2", case_control="case", line=T, xlim = c(0.025,0.04), ylim = c(-0.008,0.00))