# Why ordination?

In most introductory statistics classes in the life sciences, analysis methods are usually restricted to a single (univariate) response, or $$Y$$ variable. In this case, a measured attribute of a sample, such as productivity or species diversity, is modeled as a function of one or more predictor, or $$X$$ variables. However, responses can be multivariate, as well, when a interested in more than one attribute of a sample.

When are we likely to have a multivariate response variable? In ecology and other life sciences, species composition data is increasingly used as a response – think counts of plants occurring in different conservation areas or comparisons of gut microbiomes across different diets. The simplest approach to analyzing this sort of data would be to ask if the first species I found increases in response to some environmental factor, then repeat this univariate model for every species found. However, this approach would be tedious, miss out on key correlations between species, and suffer from multiple comparison problems.

Instead, we often want to evaluate how environmental factors control the abundance of multiple species simultaneously. When analysis is directed toward analyzing the properties of a species matrix, containing counts of different species (columns) by samples (rows), this approach is known as ordination.

Ordination comes in two flavors:

• Unconstrained Ordination: Samples are arranged into a smaller number of dimensions, independent of any explanatory variables. Primarily used to visualize information and detect potential gradients. Hypothesis generating.

• Constrained Ordination: Samples are arranged onto a number of dimensions that are defined by a another set of environmental variables. Used to test relationships between species abundances and environmental gradients. Hypothesis testing.

In most applications, these two approaches work in tandem.

Regardless of the scale or taxa involved, most community (species x sample) data matrices share some general properties:

• They tend to be sparse: a large portion (often the majority) of entries consists of zeros.

• Most species are infrequent. That is, the majority of species is typically present in a minority of locations, and contributes little to the overall abundance.

• The number of factors influencing species composition is potentially very large. For example, forest tree density can be influenced by time since fire, elevation, nutrients, soil depth, soil texture, water availability and many other factors.

• The number of important factors is typically few. That is, a few factors can explain the majority of the explainable variation. Another way of saying this is that the intrinsic dimensionality is low.

• There is much noise. Even under ideal circumstances, replicate samples will vary substantially from each other. This is largely due to stochastic events and contingency, though observer error may also be appreciable.

• There is much redundant information: species often share similar distributions. For example, the abundance of one species gives some insights into the abundance of another. It is this property of redundancy that allows us to make sense of compositional data.

For any ordination method to be generally useful, it must be able to cope with the above properties of community data matrices.

From Michael Palmer’s Ordination Site

# The Species Matrix

To begin, we’ll pull an example species matrix from the vegan package in R. This data is from Väre et al (1995), a study of reindeer grazing effects on understory plant communities.

The two components of this dataset are:

1. A species x sample matrix, “varespec”
2. A soil chemistry x sample matrix, “varechem”
library(vegan); library(tidyverse); library(viridis)

data("varespec") # Community data
data("varechem") # Environmental data

veg_comm <- varespec
veg_envt <- varechem

head(veg_comm)
##    Callvulg Empenigr Rhodtome Vaccmyrt Vaccviti Pinusylv Descflex Betupube
## 18     0.55    11.13     0.00     0.00    17.80     0.07     0.00        0
## 15     0.67     0.17     0.00     0.35    12.13     0.12     0.00        0
## 24     0.10     1.55     0.00     0.00    13.47     0.25     0.00        0
## 27     0.00    15.13     2.42     5.92    15.97     0.00     3.70        0
## 23     0.00    12.68     0.00     0.00    23.73     0.03     0.00        0
## 19     0.00     8.92     0.00     2.42    10.28     0.12     0.02        0
##    Vacculig Diphcomp Dicrsp Dicrfusc Dicrpoly Hylosple Pleuschr Polypili
## 18     1.60     2.07   0.00     1.62     0.00      0.0     4.67     0.02
## 15     0.00     0.00   0.33    10.92     0.02      0.0    37.75     0.02
## 24     0.00     0.00  23.43     0.00     1.68      0.0    32.92     0.00
## 27     1.12     0.00   0.00     3.63     0.00      6.7    58.07     0.00
## 23     0.00     0.00   0.00     3.42     0.02      0.0    19.42     0.02
## 19     0.00     0.00   0.00     0.32     0.02      0.0    21.03     0.02
## 18     0.13     0.00     0.13     0.12     0.00    21.73    21.47     3.50
## 15     0.23     0.00     0.03     0.02     0.00    12.05     8.13     0.18
## 24     0.23     0.00     0.32     0.03     0.00     3.58     5.52     0.07
## 27     0.00     0.13     0.02     0.08     0.08     1.42     7.63     2.55
## 23     2.12     0.00     0.17     1.80     0.02     9.08     9.22     0.05
## 19     1.58     0.18     0.07     0.27     0.02     7.23     4.95    22.08
## 18     0.30     0.18     0.23     0.25     0.25     0.23     0.00     0.00
## 15     2.65     0.13     0.18     0.23     0.25     1.23     0.00     0.00
## 24     8.93     0.00     0.20     0.48     0.00     0.07     0.10     0.02
## 27     0.15     0.00     0.38     0.12     0.10     0.03     0.00     0.02
## 23     0.73     0.08     1.42     0.50     0.17     1.78     0.05     0.05
## 19     0.25     0.10     0.25     0.18     0.10     0.12     0.05     0.02
##    Cladamau Cladsp Cetreric Cetrisla Flavniva Nepharct Stersp Peltapht Icmaeric
## 18     0.08   0.02     0.02     0.00     0.12     0.02   0.62     0.02        0
## 15     0.00   0.00     0.15     0.03     0.00     0.00   0.85     0.00        0
## 24     0.00   0.00     0.78     0.12     0.00     0.00   0.03     0.00        0
## 27     0.00   0.02     0.00     0.00     0.00     0.00   0.00     0.07        0
## 23     0.00   0.00     0.00     0.00     0.02     0.00   1.58     0.33        0
## 19     0.00   0.00     0.00     0.00     0.02     0.00   0.28     0.00        0
## 18        0     0.25        0
## 15        0     1.00        0
## 24        0     0.33        0
## 27        0     0.15        0
## 23        0     1.97        0
## 19        0     0.37        0
head(veg_envt)
##       N    P     K    Ca    Mg    S    Al   Fe    Mn   Zn  Mo Baresoil Humdepth
## 18 19.8 42.1 139.9 519.4  90.0 32.3  39.0 40.9  58.1  4.5 0.3     43.9      2.2
## 15 13.4 39.1 167.3 356.7  70.7 35.2  88.1 39.0  52.4  5.4 0.3     23.6      2.2
## 24 20.2 67.7 207.1 973.3 209.1 58.1 138.0 35.4  32.1 16.8 0.8     21.2      2.0
## 27 20.6 60.8 233.7 834.0 127.2 40.7  15.4  4.4 132.0 10.7 0.2     18.7      2.9
## 23 23.8 54.5 180.6 777.0 125.8 39.5  24.2  3.0  50.1  6.6 0.3     46.0      3.0
## 19 22.8 40.9 171.4 691.8 151.4 40.8 104.8 17.6  43.6  9.1 0.4     40.5      3.8
##     pH
## 18 2.7
## 15 2.8
## 24 3.0
## 27 2.8
## 23 2.7
## 19 2.7

The tabasco function in vegan provides a simple heatmap of species (row) abundances vs. sample number (column). Yellow cells indicate high species abundances, white cells indicate the absence of a species in that sample.

vegan::tabasco(veg_comm, col = viridis(12, direction = -1))

What do you notice about species abundance / incidence in this plot? Is this dataset well-suited to ordination?

## Coenoclines

Relationships between species communities and environmental gradients are given their own name – coenoclines. The code below can be used to plot relationships between species abundance and different environmental variables. What do you notice about species distributions in response to different variables?

coeno_plot <- function(chosen_var){

var_to_plot <- veg_envt %>% select(chosen_var)

cbind(veg_comm, var_to_plot) %>%
gather(key = "species", value = "abundance", -chosen_var) %>%
ggplot(aes(x = !! sym(chosen_var),
y = abundance,
color = species)) +
geom_point() +
stat_smooth(se = FALSE, alpha = .5, method = "loess") +
ylim(-5, max(veg_comm))
}
names(veg_envt)
##  [1] "N"        "P"        "K"        "Ca"       "Mg"       "S"
##  [7] "Al"       "Fe"       "Mn"       "Zn"       "Mo"       "Baresoil"
## [13] "Humdepth" "pH"
coeno_plot("pH")

# Distance Measures

Across samples, it is clear that species abundances and instances vary considerably. Across all recorded species, how do we determine how different two samples are? Would we simply sum up the total differences between each column? Or some other approach?

These decisions underpin the importance of distance measures in ordination. Given a series of points across 1 - many dimensions, a distance measure provides a quantitative assessment of how dissimilar two samples are.

### How do we define dissimilarity?

Ecological similarity can be defined in a number of different ways, depending on the distance measure used. These metrics will be sensitive to different patterns in the data – in analyzing data, you will have to make a choice in how you express similarity.

As an example, let’s assume we’re asked to describe the distance between buildings in New York City. How would you describe this? The lat-long distance between points? The number of city blocks between them? Or something else entirely?

Note that three of these lines – red, blue, yellow – are all the same length: 6 blocks vertically and 6 blocks horizontally.

Distance in lat-long space is referred to as Euclidean distance. Euclidean distance can also be generalized to many dimensions. The same is true for the block-by-block distance, appropriately named Manhattan distance or “taxicab distance”, which is the sum of differences in X + sum of differences in Y, etc.

Distance metrics have certain properties:

Distances are always greater than or equal to 0 $1. d(x, y) \geq 0$

A distance of zero means that samples are identical $2. d(x, y) = 0 \iff x = y$

Distances are symmetric; It doesn’t matter where your starting point is $3. d(x, y) = d(y, x)$ Triangle inequalty; a distance between two communities is less than or equal to the summed distance between those two communities and a third $4. d(x, y) \leq d(x, z) + d(z, y)$

Note: The last property is often violated by several popular ecological distance functions. In this case, these distances are considered “semimetric”. Bray-Curtis (Sorenson) dissimilarity is a popular metric that violates this property.

## What distance measure do I use?

The number of choices of different distance measures can be dizzying – thankfully, there are some common distance functions used in ecology that are defined by their relationship to species abundance and richness (presence/absence).

The central concerns to choosing a distance metric are:

1. How much weight do I want to give common vs. rare species?

2. What is the length of the environmental gradient that I am measuring? Are gradients long enough that species are expected to show unimodal patterns of abundance? Or do I expect linear relationships?

3. Am I willing to transform my data? More on that here

In addition to the standard Euclidean distance, two of the most popular dissimilarity measures in ecology are the Bray-Curtis (abundance-weighted) and Jaccard (binary, presence-absence) distances.

Euclidean $\sqrt{\sum_{i=1}^S (n_{1i} - n_{2i})^2}$

• $$n_{1i}$$ = abundance of species $$i$$ in site 1
• $$n_{2i}$$ = abundance of species $$i$$ in site 2
• $$S$$ = total number of species
• What is the total deviation in units of abundance between two communities?

Bray-Curtis: $\sum_{i=1}^S \frac{2 * min(n_{1i}, n_{2i})}{\sum{n_{1i}} + \sum{n_2i}}$

• $$n_{1i}$$ = abundance of species $$i$$ in site 1
• $$n_{2i}$$ = abundance of species $$i$$ in site 2
• $$S$$ = total number of species
• What proportion of total species abundances differ between two communities?

Jaccard: $\frac{A}{A + B+ C}$

• $$A:$$ number of species in both sites
• $$B:$$ number of species in first site only
• $$C:$$ number of species in second site only
• What proportion of the combined species pool is found between two communities?

# 2 Example ecological communities
com1 <- data.frame(spA = 25, spB = 50, spC = 10, spD = 0, spE = 0, spF = 0, spG = 0, spH = 0, row.names = "Com 1")
com2 <- data.frame(spA = 12, spB = 25, spC = 5, spD = 1, spE = 2, spF = 1, spG = 1, spH = 1, row.names = "Com 2")
(com_sample <- rbind(com1, com2))
##       spA spB spC spD spE spF spG spH
## Com 1  25  50  10   0   0   0   0   0
## Com 2  12  25   5   1   2   1   1   1
# Euclidean distance #
# Large differences due to relative abundances -- agnostic to species identity
vegdist(com_sample, method = "euclidean") 
##          Com 1
## Com 2 28.75761
# Bray-Curtis distance #
# Difference in abundance relative to total abundance of that species, bounded between 0 and 1
vegdist(com_sample, method = "bray") 
##           Com 1
## Com 2 0.3684211
# Jaccard (binary) distance #
# 3/8 species shared, bounded between 0 and 1
vegdist(com_sample, method = "jaccard", binary = TRUE) 
##       Com 1
## Com 2 0.625