Learn to understand what biodiversity means, how to quantify and compare it using metrics such as Shannon-Weiner and Simpson’s indexes, and visualize it using Rank-Abundance Curves. We use a dataset featuring rich invertebrate diversity collected from shallow stream beds in the Turkey Lakes area in Ontario, Canada.
Biodiversity is a hot topic in the face of unprecedented global change in the 21st century. With a publicly available dataset, from the Turkey Lakes Watershed in Ontario, Canada, we demonstrate how to quantify biodiversity quantitatively and visually using R.
Using historical data from a freshwater lake ecosystem, by the end of this tutorial you will be able to:
In an ever-changing world, we are often concerned with how organisms will be able to cope with disturbances and increasingly, global climate change. Biodiversity is the variety of life at all levels, from genes to ecosystems. Not only is biodiversity important for ecosystems to function properly, it is also an important buffer for withstanding changing environments. When there is high biodiversity, there is a higher chance of some organisms surviving ecosystem change due to disturbance or climate change. Biodiversity therefore increases the resiliency of these ecosystems and the likelihood that they will be able to persist.
We have hidden the R code that is not part of the learning objectives, but if you want to view or recreate this code, click the “Show Code” button
knitr::include_graphics("biodiv_scheme.png") # including an image
Figure 1: Importance of biodiversity for downstream effects on ecosystem success, structure, and function. Reproduced from (Loreau et al. 2001).
Ecologists are often interested in the effects of perturbations (e.g. disturbances like fires, logging, flooding or pollution) on biodiversity, which can yield information and context for ecosystem health. However, boiling biodiversity down to a single metric can be difficult and incomplete depending on the goal. With the help of the Turkey Lake dataset of benthic invertebrates, we will present you with a number of common metrics to assess biodiversity, their meaning, and how they can be visualized with the help of rank-abundance curves.
First, we will teach you how to calculate these metrics using equations and abundance data, and then we will teach you how to use the codyn package to calculate these biodiversity metrics quickly and easily.
You will then have the option to learn to calculate biodiversity metrics yourself using an R package codyn.
library(leaflet)
prefix = 'https://api.gbif.org/v2/map/occurrence/density/{z}/{x}/{y}@1x.png?'
# downloading global species occurrence data
style = 'style=purpleYellow.point'
tile = paste0(prefix,style)
leaflet() %>% # making an interactive map
setView(lng = 20, lat = 20, zoom = 01) %>% # setting the coordinates
addTiles() %>%
addTiles(urlTemplate=tile) # adding our biodiversity layer
Figure 2: Species biodiversity across the globe, downloaded from the Global Biodiversity Information Facility (GBIF.org 2022).
Biodiversity is a broad term, and can be difficult to quantify. Biologists have created several standard measures of biodiversity that convey different information about ecosystems and the species living there.
Often times, biologists interested in collecting biodiversity data collect a standard set of responses (see Fig. 3). These often include:
Often this data is collected over a long period of time, which we call time series data. This allows us to see how these ecosystems are changing due to some disturbance or other changing environmental condition.
knitr::include_graphics("species_scheme.png") # adding an image
Figure 3: Schematic of freshwater lake ecosystem demonstrating the different information biologists to collect to understand biodiversity
The Turkey Lakes watershed is located in Ontario, Canada, approximately 50 km north of Sault Ste. Marie. It is 10 km2 in area and contains a chain of 4 lakes. The Turkey Lakes Watershed Study (see map) is an initiative by several agencies of the government of Canada, initially designed to study the effects of acid rain on forest ecosystems.
knitr::include_graphics("TurkeyMap.jpg")
Figure 4: Turkey Lakes watershed and catchment areas (labelled as cXX). Map reproduced from Webster et al. (2021), which features a summary of the research carried out within this watershed in the past 40 years (circa 2020).
From 1995 to 2009, scientists collected, identified and counted benthic invertebrates from various stream beds around the Turkey Lakes catchment. Benthic invertebrates are small, often microscopic organisms (see Fig. 5), but form an important link between aquatic and terrestrial habitats. They can decompose terrestrial matter such as leaves, or consume periphyton (algae and cyanobacteria) growing on rocks within their streams. Benthic invertebrates can be a food source for aquatic animals like small fish, or terrestrial animals such as birds.
An experiment was conducted in 1997 where certain sampling sites were logged with different intensities. We can hypothesize that the biodiversity of benthic invertebrate communities might be affected by logging because of the effects that logging has on the stream ecosystem such as increased sunlight and temperature, increased photosynthetic activity in the stream, and decreased leaf litter. This 14 year time series allows us to look at the immediate effects of logging as well as the subsequent recovery of the stream as foliage begins to grow back.
knitr::include_graphics("benthicmacroinvertebrates_g.carter_noaa600px.jpg") # include image
Figure 5: Various benthic macroinvertebrates under a stereo microscope (EPA 2022)
These data are from the Turkey Lakes watershed experiment. This dataset contains the abundances of benthic invertebrate species measured in the May and June from 1995 to 2001. A surber sampler (see video below) was used to collect invertebrates from stream beds.
pacman::p_load("vembedr") # load package for including video
The downloaded binary packages are in
/var/folders/5x/g_hsyrnj77g9ty9j4n7c0kp80000gn/T//RtmpokR10U/downloaded_packages
embed_url("https://www.youtube.com/watch?v=BNrc9YkPpfA") %>%
use_bs_responsive()
Several catchments within the watershed were sampled at ten sites each (for replication). In 1997, some catchments were logged with varying harvest intensities (low, medium, and high).
First, we will subset the data, clean and process it in order to visualize and compare rank-abundance curves.
Later in the tutorial, we will using the subsetted data with the codyn package to analyze the diversity of these benthic invertebrates over time.
The data “TWL_invertebrateDensity.csv” from the Government of Canada “Open Government” Portal on the “Turkey Lakes Watershed” page can be downloaded and saved as a data.frame object using the read.csv function.
We can look at all the columns of the dataset using the colnames() function.
# load Turkey Lakes invertebrate density data
# from the Govornament of Canada website
df.read <- read.csv(file="https://ftp.maps.canada.ca/pub//nrcan_rncan/Forests_Foret/TLW/TLW_invertebrateDensity.csv")
# Or, if that doesn't work, from our GitHub:
if(!exists("df.read")) {
df.read <-
read.csv("https://raw.githubusercontent.com/Living-Data-Tutorials/website/main/_lessons/2022-08-26-species-richness-rank-abundance-curves/Data/TLW_invertebrateDensity.csv")
}
Now, take a look at the column names:
colnames(df.read)
[1] "catchment" "year" "month"
[4] "replicate" "Aeshna" "Alloperla"
[7] "Ameletus" "Amphinemura" "Antocha"
[10] "Baetis" "Boyeria" "Centroptilum"
[13] "Ceratopogonidae" "Chelifera" "Cheumatopsyche"
[16] "Chironomidae" "Chloroperlidae" "Chloroperlinae"
[19] "Clinocera" "Clitellata" "Collembola"
[22] "Culicidae" "Dicranota" "Diptera"
[25] "Diura" "Dixa" "Dixidae"
[28] "Dolichopodidae" "Dolophilodes" "Dytiscidae"
[31] "Elmidae" "Empididae" "Epeorus"
[34] "Ephemerella" "Ephemerellidae" "Epitheca"
[37] "Eurylophella" "Glossosoma" "Gomphidae"
[40] "Habrophlebia" "Haploperla" "Heptagenia"
[43] "Heptageniidae" "Hesperophylax" "Heterocloeon"
[46] "Hexatoma" "Holorusia" "Hydatophylax"
[49] "Hydrophilidae" "Hydropsyche" "Hydropsychidae"
[52] "Hydroptila" "Isogenoides" "Isoperla"
[55] "Lepidostoma" "Leptoceridae" "Leptophlebia"
[58] "Leptophlebiidae" "Leuctra" "Limnephilidae"
[61] "Limnephilus" "Limnophila" "Lype"
[64] "Molophilus" "Natarsia" "Nemoura"
[67] "Neophylax" "Neurocordulia" "Odonata"
[70] "Onocosmoecus" "Oreogeton" "Ormosia"
[73] "Oxyethira" "Paracapnia" "Paraleptophlebia"
[76] "Parapsyche" "Pedicia" "Plecoptera"
[79] "Polycentropus" "Prosimulium" "Pseudolimnophila"
[82] "Pseudostenophylax" "Psychoglypha" "Pycnopsyche"
[85] "Rhithrogena" "Rhyacophila" "Serratella"
[88] "Simulium" "Stenacron" "Stenonema"
[91] "Tabanidae" "Taeniopteryx" "Tanypodinae"
[94] "Tanytarsini" "Tipula" "Tipulidae"
[97] "Trichoptera"
As you can see, the columns of the dataset are:
Catchment number. There are 11 catchments (stream study areas), and each has an alphanumeric code.
Year and month that the data was collected.
A number for the replicate (there are 10 replicates for each sampling date, which means they counted the abundances of benthic invertebrates from ten sites within each catchment on each sampling date).
A column for each of the species with their densities. Densities in units of individuals per m2 were calculated by dividing the counts by the area of the surber sampler (see video above; 0.32 m2 in this case).
First, in order to quantify biodiversity, we will convert the species densities reported in the dataset back to the raw counts (i.e. abundances) to obtain whole numbers of individuals.
To do this, it is helpful to transform the “wide” format data.frame (where each species is a column) to “long” format, where species names are all found in a single column using the tidyr::pivot_longer() function.
library(dplyr)
df.count <-
df.read %>%
# Convert from wide to long format to simplify the operation
tidyr::pivot_longer(
# Select the columns we want in long format
# (i.e. the columns with invertebrate species)
"Aeshna":"Trichoptera",
# Name the column that we are creating for
# the names of the invertebrate species
names_to = "Species",
# Name the column that we want the density values to do to
values_to = "Density" ) %>%
mutate(
# Change densities to counts
Count = Density * 0.33,
# Replace NAs with 0 because NA
# signifies species absnce
Count = tidyr::replace_na(Count, 0)
)
⚠️ If you plan on using metrics of biodiversity, we recommend using Hill numbers for your analysis. See Roswell, Dushoff, and Winfree (2021). Although Hill numbers are outside the scope of this tutorial, understanding the biodiversity metrics presented here will help you understand Hill numbers as well.
There are several different ways to analyse and understand changes in community composition, or the relative abundances of all taxa in a community. Each of these metrics allow us to look at different aspects of the biodiversity data we collect (MacDonald, Nielsen, and Acorn 2016).
Each of these metrics can be calculated by hand using an equation.
Click through the tabs to learn how to calculate these different measures of biodiversity mathematically and using R!
Perhaps the simplest measure of biodiversity is species richness (\(S\)).
The equation for species richness is:
\(S = \sum_{i=1}^{S}p_i^o\)
\(p_i^o\) represents the proportion of individuals of species \(i\), taking the sum across each of these species.
Step 1: Count the number of unique species recorded in that year. Then you can look at how the number of different species in a community changes over time.
Interpretation: A higher number means there are more species, and therefore species richness is higher.
#loading necessary libraries (kableExtra makes tables)
pacman::p_load(kableExtra) # load package for making table
richness_table <- df.count %>%
# Removing species that had 0 observations
filter(Count > 0,
year %in% c(1995, 1997, 2001, 2009)) %>%
group_by(year) %>%
summarise(richness=length(unique(Species)), # counting the number of species that were observed in each year
# Check how many catchments were sampled in each year
catchments = length(unique(catchment)))
#Nice looking table using the kable() function
kable(richness_table, col.names=c("Year", "Richness", "Catchments Sampled"), align="c")%>%
kable_classic(full_width = F, html_font = "Cambria")
| Year | Richness | Catchments Sampled |
|---|---|---|
| 1995 | 48 | 6 |
| 1997 | 51 | 9 |
| 2001 | 65 | 9 |
| 2009 | 50 | 3 |
We can see in this table that across years, the number of species in the same area fluctuated. However, this metric is particularly sensitive to changes in sampling efforts because additional sampling is likely to uncover additional, rare species. As a result, the only reliable comparison is between 1997 and 2001, which suggests that species richness increased following the logging operations of 1997. However, this does not tell us anything about how the abundances of different species changed.
Another metric we can use to analyze biodiversity is the Shannon-Wiener Index (\(H\)), which indexes across the sum of all relative proportions of species richness with an additional logarithmic transformation. This equation takes into account both the number of species in a specific area (richness) and relative abundances (evenness).
The equation for the Shannon-Weiner Index is:
\[ H' = - \sum_{i=1}^{S}p_i~\ln(p_i) \]
Step 1: Calculate \(p_i\) for each species.
\[ p_i = n_i/N \]
Where \(n_i\) is the number of individuals of species \(i\) and \(N\) is the total number of individuals in the community.
Step 2: Multiply the proportion of each species (\(p_i\)) by the natural logarithm of the proportion (\(\ln(p_i)\))
Step 3: Sum each of these values for each species.
Step 4: Multiply the sum by -1.
Interpretation: The minimum value of the Shannon’s diversity index is 0, which means that there is no diversity (i.e. only one species is found in that habitat). The values increase as the number of species increase, and is maximized at a given number of species when is an equal abundance of each species.
shannon <-
df.count %>%
filter(year %in% c(1995, 1997, 2001, 2009)) %>%
# Summarise over catchments and replicate sites
group_by(year, Species) %>%
summarise(TotalCount=sum(Count)) %>%
mutate(
# Step 1
# Note that sum(TotalCounts) calculates the "N"
# for each year separately because the data.frame
# was previously grouped by year using the
# function group_by().
pi = TotalCount/sum(TotalCount),
# Step 2
pi_ln_pi = pi*log(pi)) %>%
# Steps 3 & 4.
# Note that na.rm = T because for species abundances = 0,
# p_i * log(p_i) = 0 * -Inf = NaN
summarise(shannon = -1*sum(pi_ln_pi, na.rm=T))
kable(shannon, col.names=c("Year", "Shannon's Index"), align="c")%>%
kable_classic(full_width = F, html_font = "Cambria")
| Year | Shannon’s Index |
|---|---|
| 1995 | 2.407170 |
| 1997 | 2.081374 |
| 2001 | 2.448311 |
| 2009 | 2.540719 |
We can see that the Shannon diversity changes over time, with the lowest value being the year of the forest harvesting operations (1997). However, it seems that biodiversity quickly recovers in the years following the harvesting.
Another metric we can use to analyse biodiversity in the Simpson’s index (\(D\)) which indexes across the sum of all relative proportions of species richness with an additional square power transformation. Though very similar to the calculation in the Shannon-Wiener Index, the Simpson index is more focused on dominance of species as it highlights the proportion of species in a sample.
The equation for Simpson’s diversity is:
\[D = 1-\sum_{i=1}^{S} \frac {n_i(n_i-1)}{N(N-1)}\]
Where \(n_i\) is the number of individuals of species \(i\) and \(N\) is the total number of individuals in the community.
Step 1: Multiply the number of individuals of a given species (\(n_i\)) by (\(n_i-1\)).
Step 2 Multiply the total number of individuals in the community (\(N\)) by (\(N-1\)).
Step 3: Divide the number from step 1 by the number from step 2.
Step 4: Once you have the numbers from step 3 for each species in the community, sum all of these together and substract from 1.
Interpretation: The higher the value of \(D\), the greater the diversity in the community. An index close to 1 means that there are several species in the community, and the population proportion of species is even. On the other hand, \(D=0\) indicates a single-species community
simpson <- df.count %>%
filter(year %in% c(1995, 1997, 2001, 2009)) %>%
# Summarise over catchments and replicate sites
group_by(year, Species) %>%
summarise(TotalCount=sum(Count)) %>%
mutate(
# Step 1
num = TotalCount*(TotalCount - 1 ),
# Step 2 (Note: data.frame grouped by year)
denom = sum(TotalCount)*(sum(TotalCount)-1),
# Step 3
frac = num/denom) %>%
# Step 4
summarise(simpson = 1 - sum(frac))
kable(simpson, col.names=c("Year", "Simpson's Index"), align="c")%>%
kable_classic(full_width = F, html_font = "Cambria")
| Year | Simpson’s Index |
|---|---|
| 1995 | 0.8372515 |
| 1997 | 0.7452368 |
| 2001 | 0.8423991 |
| 2009 | 0.8757629 |
Similarly to the Shannon-Weiner index, biodiversity dropped in 1997, but had recovered by 2001.
Additionally, we can think about the evenness of species across a given area. Evenness is a type of metric for assessing species dominance. If evenness is high, it means most species are of equal dominance. If evenness if low, it means some species are more dominant (i.e. have higher abundance) than others. Evenness is maximized when all species have equal abundances. Multiple metrics can be used to measure evenness, here we show an example of “Shannon-Wiener” evenness, however a similar approach can be used for simpson evenness.
The equation for “Shannon-Wiener” evenness is:
\(E = H / ln(S)\)
Step 1: Divide the value of the Shannon-Wiener index (\(H\)) by the species richness (\(S\)).
Interpretation: Values closer to one signify more evenness and values closer to 0 signify lower evenness.
# Combine previously calculated species richness
# and shannon index tables
evenness =
left_join(richness_table, shannon) %>%
# Step 1
mutate(evenness = shannon/log(richness)) %>%
select(year, evenness)
kable(evenness, col.names=c("Year", "Evenness"), align="c")%>%
kable_classic(full_width = F, html_font = "Cambria")
| Year | Evenness |
|---|---|
| 1995 | 0.6218149 |
| 1997 | 0.5293657 |
| 2001 | 0.5865078 |
| 2009 | 0.6494644 |
As with previous biodiversity metrics, we find that evenness is lowest in 1997, and recovers in the following years.
Apart from calculating biodiversity using the metrics above we can also visualize differences using plots.
One way to visualize biodiversity data is to make a rank-abundance curve. On the y-axis we have the species (numbered), and on the y axis we have the abundance. The plots are organized from high to low abundance. Using rank-abundance curves, we can visualize the change in relative abundance and species dominance over time.
As an example, we use a subset of the Turkey Lakes data with six of the more abundant species from the dataset to make a simple rank-abundance curve.
#loading necessary libraries
pacman::p_load(tidyverse, patchwork)
#creating a subset of the full data with just 6 species, across 4 years
subset <- df.count %>%
filter(Species %in% c("Chironomidae", "Prosimulium", "Baetis", "Heterocloeon", "Chelifera", "Leuctra")) %>%
filter(year %in% c(1995, 1997, 2001, 2009)) %>%
group_by(year, Species) %>%
summarise(sum=sum(Count))
#1995
a<- subset %>%
filter(year==1995) %>%
ggplot(aes(x=reorder(Species, -sum), y=sum, shape=Species, colour=Species))+
geom_point(size=4)+
scale_colour_viridis_d()+
ggtitle("1995")+
theme_classic()+
labs(x="Species", y="Abundance", color="Species")+
theme(axis.text.x = element_text(face="italic"), legend.position="none") # Use italics for species names
#2001
b<- subset %>%
filter(year==2001) %>%
ggplot(aes(x=reorder(Species, -sum), y=sum, shape=Species, colour=Species))+
geom_point(size=4)+
scale_colour_viridis_d()+
ggtitle("2001")+
theme_classic()+
labs(x="Species", y="Abundance", color="Species") +
theme(axis.text.x = element_text(face="italic"), legend.position="none") # Use italics for species names
a / b
Figure 6: Simple rank-abundance curve with 6 species from the Turkey Lakes dataset, with counts from 1995 and 2001. Colour and shape indicate a unique species, with y axis representing abundance.
We can see here between 1995 and 2001 in our benthic community, there were small differences in absolute counts, with the most abundant species (Chrironomidae) decreasing ~5000 in abundance of individuals, but still remaining the most abundance species overall across both years. Chironomidae are often very abundant in streams, and are good at taking advantage of the benefits that may be provided by logging, such as increased sunlight availability causing an increase in photosynthetic activity and increased periphyton (algae), which is their primary food source. Overall abundance of all species decreases in 2001 compared to 2005, likely because of the regrowth of foliage after logging and decrease in photosynthetic activity in the stream. Baetis was the 2nd most abundant species in 1995, but dropped to 5th most abundant by 2001. This suggests that this species may do better immediately following disturbance. However, we would need to look at data from more years to draw conclusions.
knitr::include_graphics("TopSpecies.png")