<- read.delim(paste0(path, "data/ASV.txt"), sep = "\t", header = TRUE)
ASV # ASV = abundance data of OTUs at sampling locations (OTU x location)
<- read.delim(paste0(path, "data/ENV.txt"), sep = "\t", header = TRUE)
ENV # ENV = environmental data of sampling locatoins
<- read.delim(paste0(path, "data/Sample_Classification.tsv"), sep ="\t", header = T)
LOC # LOC = Location data for sampling locations eg. depth / ocean layer
Mediterranean Microbial Network Part I: Descriptive Community Ecology using NMDS
Non-metric Multi-Dimensional Scaling (NMDS) is a common tool to show the structure of a community and a flexible way to reduce dimensions. This is the first post of the Mediterranean Microbial Network series.
- This is a meta-analysis for demonstration in M. Sc. Biology module Network Analysis in Ecology and Beyond offered by Prof. Dr. Jochen Fründ at University of Hamburg.
- The data used is credited to Deutschmann et al. (2024) and available in the original repo, as well as in our repo.
- Data wrangling was done by Jan Moritz Fehr.
Ocean Microbiome
We adopted the ocean microbial data processed and published by Deutschmann et al. (2024), and focused on the Mediterranean subset. Thank to the comprehensive record of ecological parameters, we were able to divide the dataset according to their depth. The data was originally categorized in 5 layers:
- Surface (SRF): 0 - 3 m
- Epipelagic (EPI): 12 - 50 m
- Deep-Chlorophyll Maximum (DCM): 40 - 130 m
- Mesopelagic (MES): 200 - 1000 m
- Bathypelagic (BAT): 1100 - 3300 m
We have combined EPI and DCM for the sake of simplicity and also because DCM is recognized as part of EPI. EPI is also called as sunlit zone, where solar resource could still be utilized by photosynthesis. As the red light being absorbed faster than the blue light, in MES, also known as twilight zone, there’s only dim blue light present, which can not be used for photosynthesis. The light will eventually disappear in BAT, at which depth the most area of Mediterranean Sea hits the sea floor.
In each layer, the species composition besides microorganism is different: in SRF and EPI, phytoplanktons are able to import solar energy into the ecosystem and this attracts small zooplanktons to feed at this depth. As photosynthesis becomes no longer possible in MES, it is home to many nektons: fishes, squids, planktonic crustaceans, etc. In BAT, as we approach the sea floor, benthic fauna becomes an important part. Such differences in interaction partners will also affect microorganism communities. But in what extent? Or how are the communities different? The co-occurrence / association data of the microorganism can help us answer those questions.
Deutschmann et al. (2024) published an Amplicon Sequences Variants (ASV) abundance data, which comprises an abundance matrix in the form that each row represents a sample site and each column represents an OTU (Operational Taxonomic Unit, commonly used in microbiology as proxies for species concept). This matrix directly represents an adjacency matrix for a species-location bipartite network.
For more information about ocean layers, kindly see Know Your Ocean from WHOI.
Lets get the data imported and start with some basic data wrangling:
MS
is now a dataframe comprising n = 145
rows as sample sites and p = 3208
columns of OTU, which stores the abundance data in the form of an adjacency matrix. There are additional 16 columns at the end for ecological data.
Non-metric Multi-Dimensional Scaling (NMDS)
The most intuitive way to have an overview on the community is to plot them out, NMDS is the most common way to do this.
library(tidyverse)
library(vegan)
set.seed(1)
<- as.matrix(MS[, 1:ab.mtx.ncol]) # extract abundance matrix ab.mtx
Data Preparation
Taking a brief look at the data, it’s very clear that out microbial data is actually largely skewed. There are a few species with extremely high reads, and a vast majority being very low frequent, and almost 85% of the cells in our matrix are zero.
== 0] %>% length() / ab.mtx %>% length() ab.mtx[ab.mtx
[1] 0.8453521
max(ab.mtx)
[1] 59282
hist(ab.mtx)
It’s worth noting that vegan::metaMDS()
, the function that does the calculation for NMDS, will try to autotransform the data if not specified autotransform = False
. In our case, considering the extreme skewness of the data and clear pattern, which transforming method to use did not really make big differences. But this is not always true. One should select these methods cautiously. Hellinger transformation is one way to reduce the weight of extremely high value.
Perform NMDS
Every community ecologist should be familiar with the basic idea of NMDS, which is dimension reduction. It converts our n x p matrix into an n x n dissimilarity matrix. This steps allows flexible decision on how the distances should be calculated. A common choice while dealing with species composition is the Bray-Curtis dissimilarity, as its semi-metric approach takes both presence/absence and abundance information into account. It is also default to metaMDS()
.
NMDS differs a bit from other ordination methods, like PCA, in handling the distances and calculating axes and coordinates. Its non-metric nature decided that the distance on the final plot does not represents the distances between dataset, but the ranking order. It also therefore, does not try to maximize the variability in axes, which means the axes are completely arbitrary and this technique is mainly used for visualization only (Bakker 2024).
Before starting NMDS, we have to specify k, which gives the number of axes we want to have. NMDS will then take iterative process to adjust the scaling. After each iteration, it calculates a stress values showing how nice the ordination represents real data. The value ranges from 0 to 1, where 0 means a perfect fit. It however indicates danger for interpreting the ordination already from 0.2. Generally a stress value under 0.1 is considered a good result (Clarke 1993; Bakker 2024).
<- metaMDS(ab.mtx.tf, k=2, autotransform = F, trymax = 50) NMDS
NMDS Plot using {ggplot2}
There are several plotting functions from {vegan}
, but since we were trying to make a fancy poster, I brought out my old script to make fully controlled plot with {ggplot2}
.
# Extract coordinates
<- as_tibble(scores(NMDS)$sites)
siteCoord <- as_tibble(scores(NMDS)$species)
spCoord
# Import aesthetics setting
source(paste0(path, "analysis/AES.R"))
# Plot the species
<- ggplot() +
p.NMDS.sp geom_point(data = spCoord, aes(x = NMDS1, y = NMDS2), size = 2, shape = 3, color = "grey", alpha = 0.25) +
aes_ACTCP_ggplot() +
theme(aspect.ratio = 1) +
coord_equal()
p.NMDS.sp
As simple as that, we have our species plotted. But we are not really interested in individual species, instead, the community at different sites. For that purpose, we have to first pass the environmental data of each site into siteCooord
.
# Extract environmental data, combine DCM with EPI
<- MS[, (ab.mtx.ncol+1):ncol(MS)] %>%
env mutate(layer = case_when(layer == "DCM" ~ "EPI", .default = layer))
<- cbind(siteCoord, env)
siteCoord_env
# Order the layer
<- siteCoord_env %>% mutate(layer = factor(layer, levels = c("SRF", "EPI", "MES", "BAT"))) siteCoord_env_ordered
Then we can plot them as usual:
# Define colors for the layers
<- c(
depth.colors "SRF" = "#6CCCD6",
"EPI" = "#007896",
"MES" = "#004359",
"BAT" = "#001C25"
)
<- c(
depth.shapes "SRF" = 15,
"EPI" = 16,
"MES" = 17,
"BAT" = 18
)
<- ggplot() +
p.NMDS.st geom_point(data = siteCoord_env_ordered, aes(x = NMDS1, y = NMDS2, color = layer, shape = layer), size = 3.5, alpha = 0.8) +
aes_ACTCP_ggplot() +
theme(aspect.ratio = 1) +
scale_color_manual(values = depth.colors) +
scale_shape_manual(values = depth.shapes) +
guides(color = guide_legend("Layer"), fill = "none", shape = guide_legend("Layer"))
p.NMDS.st
Now, this is telling us more information, we can see that sample sites from the same layer tend to cluster together, where MES and BAT form a big group distinct from EPI and SRF. A common way to show such categorical data in NMDS is to draw a convex hull on those clusters.
NMDS Hullplot using {ggplot2}
The basic idea is simply use chull()
to identify the points from the same cluster that are on the border of the hull and draw a polygon using those points. But the problem is that we have some outliers here, especially from MES (those which neighboring BAT and SRF). These points will surely become the one defining the convex hull, and make the hull covering much more area than the cluster really does.
To address this, I additionally calculate the euclidean centroids of each cluster and filter out outliers by comparing their distances to the centroids with the median distances to the centroids from all points within the same cluster. Those points that has a distance larger than a threshold calculated from median distance with defined multiplier will not be included.
# Calculate the coordinates of the hull border
<- function(coord, group_var, threshold_factor = 2, threshold_func = median) {
outliner <- colMeans(coord[, c(1, 2)])
centroid <- matrix(c(rep(centroid[1], nrow(coord)), rep(centroid[2], nrow(coord))), ncol = 2)
centroids
# Euclidean distances
$distances <- sqrt(rowSums((coord[, c(1, 2)] - centroids)^2))
coord
# Set threshold
<- threshold_factor * threshold_func(coord$distances)
threshold
# filter out points with distance larger than threshold
<- coord[coord$distances <= threshold, ]
coord_in
<- chull(coord_in[, c(1, 2)])
outline_indices
return(list(
borders = coord_in[outline_indices, ],
centroid = c(centroid, group = group_var[[1]])
)) }
threshold_factor
controls how conservative or inclusive the convex hull should be drawn. Then we apply this process to each category and extract the hull border coordinatees for ggplot. threshold_func
defines which function to use for choosing baseline of the threshold. Default is set to median()
here since mean()
can skew the baseline if any extreme outlier exists.
# Using ordered factor from `siteCoord_env_ordered` can't successfully pass the layer name through the pipeline
<- siteCoord_env %>%
layer.siteCoord group_by(layer) %>%
group_map(~ outliner(.x, .$layer), .keep = T) %>%
`names<-`(group_keys(siteCoord_env %>% group_by(layer))[[1]]) %>%
do.call(what = rbind, args = .) %>%
as.data.frame()
# Extract border and centroid coordinates
<- layer.siteCoord[["borders"]] %>% bind_rows()
borders.pCoord <- layer.siteCoord[["centroid"]] %>% bind_rows() %>% mutate(NMDS1 = as.numeric(NMDS1), NMDS2 = as.numeric(NMDS2))
centroids.pCoord
<- p.NMDS.st +
p.NMDS.st.hull geom_polygon(data = borders.pCoord, aes(x = NMDS1, y = NMDS2, fill = layer), alpha = 0.5) + # hull
geom_text(data = centroids.pCoord, aes(x = NMDS1, y = NMDS2, label = group)) +
scale_fill_manual(values = depth.colors)
p.NMDS.st.hull
Now, each layer forms a cluster that does not have much overlap with others, except EPI and SRF, which also make sense, because even microorganism in SRF might have different interacting partners than the rest of EPI, the environmental condition and the general species compositions, in particular functionally, should be similar. This is also reflected by SRF on the right side of EPI with larger distance to MES or BAT.
The space between MES and EPI is not encapsulated by any of both, it reflects the “bridge” section, probably at which depth that both layers meet each other. There’s a numerical depth data in env
, which is principally continuous data complementing our layer categories. While we use hull plot to visualize categeories in NMDS, we can use contour line to visualize continuous variables.
Fitting smooth surface
If we imagine the variables we want to display as the third dimension of our points on NMDS, visualizing the variables becomes simply visualizing a 3-D structure on our 2-D plot. One of the solution is to fit the surface on our plot using vegan::ordisurf()
.
The way to extract contour line from vegan::ordisurf()
and the function extract.xyz()
are credited to Christopher Chizinski’s post.
library(geomtextpath)
<- function(obj) {
extract.xyz print(obj)
<- expand.grid(x = obj$grid$x, y = obj$grid$y)
xy <- cbind(xy, c(obj$grid$z))
xyz names(xyz) <- c("x", "y", "z")
return(xyz)
}
<- ordisurf(NMDS, env$depth, permutation = 999, na.rm = T, plot = F)
depth.surf <- extract.xyz(obj = depth.surf) depth.contour
Family: gaussian
Link function: identity
Formula:
y ~ s(x1, x2, k = 10, bs = "tp", fx = FALSE)
Estimated degrees of freedom:
7.84 total = 8.84
REML score: 1057.237
<- p.NMDS.st.hull +
p.NMDS.st.hull.cont geom_textcontour(data = depth.contour, aes(x, y, z = z), color = "#c4438c")
p.NMDS.st.hull.cont
We see those points on the MES/EPI border could be perfectly explained by the depth.
But, what we did so far was basic descriptive work in community ecology. We still want to take a look into each community structure using network analysis. I will cover this in the next upcoming post. To be continued.
Reference
Citation
@online{hu2025,
author = {Hu, Zhehao},
title = {Descriptive {Community} {Ecology} Using {NMDS}},
date = {2025-05-16},
url = {https://zzzhehao.github.io/biology-research/community-ecology/MediMicro-NMDS.html},
langid = {en},
abstract = {This is the first post of the Mediterranean Microbial
Network series.}
}