These packages are necessary for this analysis to be conducted.

Import the data to be used in analysis. These example data are included in {hagis} and so when loading {hagis} they become available in your R session.

```
## Isolate Line Rps Total HR (1) Lesion (2)
## <int> <char> <char> <int> <int> <int>
## 1: 1 Williams susceptible 10 0 0
## 2: 1 Harlon Rps 1a 10 4 0
## 3: 1 Harosoy 13xx Rps 1b 8 0 0
## 4: 1 L75-3735 Rps 1c 10 10 0
## 5: 1 PI 103091 Rps 1d 9 2 0
## 6: 1 Williams 82 Rps 1k 10 0 0
## Lesion to cotyledon (3) Dead (4) total.susc total.resis perc.susc perc.resis
## <int> <int> <int> <int> <int> <int>
## 1: 0 10 10 0 100 0
## 2: 0 6 6 4 60 40
## 3: 0 8 8 0 100 0
## 4: 0 0 0 10 0 100
## 5: 1 6 7 2 78 22
## 6: 0 10 10 0 100 0
```

```
## Sample Locale
## 1 1 Michigan
## 2 10 Michigan
## 3 11 Michigan
## 4 12 Michigan
## 5 13 Michigan
## 6 14 Michigan
```

This removes the “MPS17_” from the isolates, so that they will be read as numeric instead of character. The next step removes the “Rps” from the gene names, so that they will be read as numeric instead of character.

```
P_sojae_survey$Isolate <-
gsub(pattern = "MPS17_",
replacement = "",
x = P_sojae_survey$Isolate)
P_sojae_survey$Rps <-
gsub(pattern = "Rps ",
replacement = "",
x = P_sojae_survey$Rps)
```

Set up the {hagis} arguments for analysis. Please see
`vignette("hagis")`

for more details on how to specify
arguments for the functions in this package.

Using `create_binary_matrix()`

transforms the data so that
it is in the correct format (binary data matrix) for PCOA analysis.

```
## 1a 1b 1c 1d 1k 2 3a 3b 3c 4 5 6 7
## 1 1 1 0 1 1 1 1 1 0 0 1 1 1
## 10 1 0 1 0 0 0 0 1 0 0 1 0 1
## 11 1 0 1 0 0 0 0 1 0 0 1 1 1
## 12 1 1 1 1 1 1 0 0 0 0 0 1 1
## 13 1 1 1 1 1 0 0 1 0 0 0 0 1
## 14 1 1 1 0 1 0 0 1 0 0 1 1 1
## 15 1 1 1 1 1 1 0 1 0 1 1 1 1
## 16 1 1 1 0 1 0 0 1 0 0 1 0 1
## 17 1 1 1 1 1 1 0 1 0 1 1 0 1
## 18 1 1 1 1 1 0 1 1 0 0 1 1 1
## 19 1 1 1 1 1 1 1 1 1 0 0 1 1
## 2 1 1 1 0 1 1 0 1 1 1 0 1 1
## 20 1 1 1 1 1 1 1 1 1 0 1 0 1
## 21 1 1 1 1 1 1 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1 0 1 0 1 0 1 1
## 4 1 0 1 1 1 1 0 1 0 0 1 0 1
## 5 1 0 1 1 1 1 0 1 0 0 0 1 1
## 6 1 0 1 1 1 1 0 1 0 0 1 0 1
## 7 1 1 1 1 1 1 0 1 0 0 0 0 1
## 8 1 1 1 1 1 1 0 1 0 0 0 0 1
## 9 1 0 1 1 0 0 0 1 0 0 1 0 1
```

The `P_sojae_survey.matrix`

object contains the RPS genes
numbered as rows and the columns are numbered isolates. A “1” indicates
the isolate caused disease on the RPS gene, and a “0” means it did
not.

The following code will transpose the
`P_sojae_survey.matrix`

and then calculate the Jaccard
distances for each isolate. Jaccard distance calculations need to be
used as pathotype data is presence/absence for virulence. Lastly, PCOA
is performed to identify the variance explained by each principal
coordinate. This will be used later to visualize and identify distance
pathotype groupings by geographic location. In this example, Jaccard
distances are used as this is presence/absence data.

After performing the principal coordinates analysis, we see that the
scree plot says that about 70% of the variation in Jaccard distances are
explained within the first two dimensions (*i.e.*, axes). This is
good. Usually a good rule of thumb is that if the second dimension is
roughly half variation explained in the first dimension you don’t need
to look further at the third or n+1 dimensions.

```
princoor.pathotype <- pcoa(P_sojae_survey.matrix.jaccard)
barplot(princoor.pathotype$values$Relative_eig[1:10])
```

Now we can calculate the percentage of variation that each principal
coordinate accounts for. Another way to calculate the percent variation
is to look at the `Relative_eig`

column.

```
# Dimension (i.e., Axis 1 (PCOA1))
Axis1.percent <-
princoor.pathotype$values$Relative_eig[[1]] * 100
# Dimension (i.e., Axis 2 (PCOA2))
Axis2.percent <-
princoor.pathotype$values$Relative_eig[[2]] * 100
Axis1.percent
```

`## [1] 50.39109`

`## [1] 21.79118`

Now we can make a data frame with the two principal coordinates that
account for the most variation in the data. We will then add metadata to
the data frame `pca.data`

.

You will need to add information on the sample collection location or
other data to help identify different pathotype groupings based on
geographic location or other factors. We will use
`left_join()`

from {dplyr} to combine the
`princoor.pathotype.data`

with the metadata that we’ve
already loaded, `sample_meta`

that contains a geographic
location (state name).

```
princoor.pathotype.data <-
left_join(princoor.pathotype.data, sample_meta, by = "Sample")
princoor.pathotype.data
```

```
## Sample X Y Locale
## 1 1 0.127655783 0.08704548 Michigan
## 2 10 -0.441075005 0.01573886 Michigan
## 3 11 -0.336826428 0.21804192 Michigan
## 4 12 0.217071376 -0.03795693 Michigan
## 5 13 -0.003862869 -0.14942952 Michigan
## 6 14 -0.108034781 0.18124849 Michigan
## 7 15 0.082862290 0.06501932 Michigan
## 8 16 -0.182559838 0.01610323 Michigan
## 9 17 0.043684771 -0.07086878 Michigan
## 10 18 -0.007048533 0.11629976 Michigan
## 11 19 0.224528803 0.06847162 Michigan
## 12 2 0.190565779 0.16984304 Michigan
## 13 20 0.081014587 -0.01986841 Michigan
## 14 21 0.156490576 0.15103874 Michigan
## 15 3 0.187665103 0.01477482 Michigan
## 16 4 -0.088772478 -0.14970662 Michigan
## 17 5 0.069295886 -0.06198926 Michigan
## 18 6 -0.088772478 -0.14970662 Michigan
## 19 7 0.102108599 -0.17330058 Michigan
## 20 8 0.102108599 -0.17330058 Michigan
## 21 9 -0.328099742 -0.11749798 Michigan
```

Now we will plot the PCA data using {ggplot2} and color the points
based on location, `Locale`

, and identify the 95% confidence
interval of those groups using the `stat_ellipse()`

function.

```
ggplot(data = princoor.pathotype.data, aes(x = X, y = Y)) +
geom_point(aes(colour = Locale)) +
xlab(paste("PCOA1 - ", round(Axis1.percent, 2), "%", sep = "")) +
ylab(paste("PCOA2 - ", round(Axis2.percent, 2), "%", sep = "")) +
theme_bw() +
theme(
axis.title.x = element_text(face = "bold", size = 15),
axis.title.y = element_text(face = "bold", size = 15),
axis.text = element_text(face = "bold", size = 10),
legend.title = element_text(face = "bold", size = 10),
legend.text = element_text(face = "bold", size = 10),
legend.key.size = unit(1, 'lines')
) +
stat_ellipse(data = princoor.pathotype.data, aes(x = X, y = Y),
level = 0.95) +
ggtitle("Pathotype Jaccard Distances PCOA")
```

When using two or more pathotype datasets for comparisons, you can use beta-diversity tests to identify if there are significant differences between their sampled pathotype compositions. These code are presented as an example for further downstream analysis that can be used when comparing two or more populations’ pathotype composition.

In these examples we will artificially split the dataset into two, so
that these analyses can be shown. When performing your own analyses you
will likely have two geographic locations to compare already. Make sure
you can differentiate these populations with the metadata file used
previously (*i.e.*, column in the dataset that specifies where
the isolate came from; USA, Brazil, China, Australia, etc.).

Beta-dispersion tests if the dispersion, variance, of two or more groups are significantly different or not. First, an item named “groups” must be made that contains lists of the location and number of isolates from that location used in the analysis. We will then check to make sure the lists in “groups” adds up to the number of isolated used in analysis.

First, make a list of the locations for each pathotype. Note that when you are using two or more locations you will need to make a list for each location with a length of isolates used.

```
groups <- factor(c(rep("Michigan_1", 11), rep("Michigan_2", 10)))
# this number shows how many isolates are in all "groups" lists combined
length(groups)
```

`## [1] 21`

```
# this shows the number of isolates within your data set, these numbers should
# match for downstream analyses to work!!
length(unique(P_sojae_survey$Isolate))
```

`## [1] 21`

Next, beta-dispersion will be calculated using the Jaccard distance matrix made previously. An ANOVA is then performed to identify significance within the dataset. Post-hoc tests can be used to identify significant interactions between specific locations within the dataset. This can then be plotted to visualize how the beta-dispersion is different between groups.

```
# calculates the beta-dispersion for each group, when comparing 2 or more
pathotype.disp <-
betadisper(P_sojae_survey.matrix.jaccard, groups)
# tests if centroid distances are significantly different from each other
pathotype.disp.anova <- anova(pathotype.disp)
pathotype.disp.anova
```

```
## Analysis of Variance Table
##
## Response: Distances
## Df Sum Sq Mean Sq F value Pr(>F)
## Groups 1 0.008375 0.0083752 0.9672 0.3377
## Residuals 19 0.164523 0.0086591
```

```
# test significance between each group
pathotype.disp.TukeyHSD <- TukeyHSD(pathotype.disp)
pathotype.disp.TukeyHSD
```

```
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = distances ~ group, data = df)
##
## $group
## diff lwr upr p adj
## Michigan_2-Michigan_1 -0.03998626 -0.1250853 0.04511275 0.3377355
```

The ANOVA identified no significant differences between groups
dispersion (p = 0.3377355). This means that the groups dispersion, or
variance, is not significantly different from each other and the groups
dispersion is likely homogeneous between groups. At this point, a Tukey
HSD test is not warranted, but we use it as an example here. Again, a
p-value of 0.3377355 is reported from the Tukey HSD tests and we reject
the hypothesis that these groups may have different dispersion. We can
plot the dispersion for each group using the `plot()`

function. As expected, since we have identified no significant
differences, the two groups dispersion overlap a great deal and are not
distinct from each other. Again this shows that pathotype dispersion
between the groups is homogeneous and not different in this
instance.

If we were working with a data set that had groups with significantly different dispersion we would expect to see a significant ANOVA p-value (p < 0.05) as well as significance when using the Tukey HSD test. Lastly, the plotted dispersion will form distinct, separate, groups which can be observed.

Differences in beta-dispersion may indicate separate pathotype groups which should be further investigated with Permutation Based Analysis of Variance (PERMANOVA) and Analysis of Similarity (ANOSIM) analysis. Groups which have similar dispersion may still be significantly different in regards to their centroids, which will be tested using a PERMANOVA.

PERMANOVA tests if the centroids, similar to means, of each group are significantly different from each other. Likewise, an \(R^2\) statistic is calculated, showing the percentage of the variance explained by the groups.

```
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
##
## adonis2(formula = P_sojae_survey.matrix.jaccard ~ groups)
## Df SumOfSqs R2 F Pr(>F)
## groups 1 0.11358 0.07869 1.6229 0.191
## Residual 19 1.32976 0.92131
## Total 20 1.44335 1.00000
```

The PERMANOVA identified no significant differences between the
groups centroids, or means (*p* = 0.191). In addition to
identifying significance between group centroids, the PERMANOVA also
calculates how much of the variance can be explained by the specified
groups (see the \(R^2\) column in the
PERMANOVA output). In this case, the \(R^2\) is 0.0786944, so 7.9% of the variance
is explained by the groups used in analysis. Based on the PERMANOVA
results we can conclude that these two groups are not different from
each other and likely have similar pathotypes to each other.

ANOSIM statistic (R) ranges from between -1 and 1. Positive numbers
suggest that there is more similarity within groups than there is
between groups. Values close to zero indicate no difference between
groups (*i.e.*, similarities are the same between groups).

```
##
## Call:
## anosim(x = P_sojae_survey.matrix.jaccard, grouping = groups)
## Dissimilarity: jaccard
##
## ANOSIM statistic R: 0.06882
## Significance: 0.159
##
## Permutation: free
## Number of permutations: 999
```

ANOSIM statistic (R) was 0.0688182, so there are more similarities between groups than there are within groups. This is evidence that the groups are not different from one another. Likewise the significance is >0.05 so there is no significant difference between groups’ similarities.