Mediterranean Microbial Network Part I: Descriptive Community Ecology using NMDS

Multivariate Analysis
Mediterranean Micobial
Published

May 16, 2025

Abstract

Non-metric Multi-Dimensional Scaling (NMDS) is a common tool to show the structure of a community and a flexible way to reduce dimensions. This is the first post of the Mediterranean Microbial Network series.

Ocean Microbiome

We adopted the ocean microbial data processed and published by Deutschmann et al. (2024), and focused on the Mediterranean subset. Thank to the comprehensive record of ecological parameters, we were able to divide the dataset according to their depth. The data was originally categorized in 5 layers:

  • Surface (SRF): 0 - 3 m
  • Epipelagic (EPI): 12 - 50 m
  • Deep-Chlorophyll Maximum (DCM): 40 - 130 m
  • Mesopelagic (MES): 200 - 1000 m
  • Bathypelagic (BAT): 1100 - 3300 m

We have combined EPI and DCM for the sake of simplicity and also because DCM is recognized as part of EPI. EPI is also called as sunlit zone, where solar resource could still be utilized by photosynthesis. As the red light being absorbed faster than the blue light, in MES, also known as twilight zone, there’s only dim blue light present, which can not be used for photosynthesis. The light will eventually disappear in BAT, at which depth the most area of Mediterranean Sea hits the sea floor.

In each layer, the species composition besides microorganism is different: in SRF and EPI, phytoplanktons are able to import solar energy into the ecosystem and this attracts small zooplanktons to feed at this depth. As photosynthesis becomes no longer possible in MES, it is home to many nektons: fishes, squids, planktonic crustaceans, etc. In BAT, as we approach the sea floor, benthic fauna becomes an important part. Such differences in interaction partners will also affect microorganism communities. But in what extent? Or how are the communities different? The co-occurrence / association data of the microorganism can help us answer those questions.

Deutschmann et al. (2024) published an Amplicon Sequences Variants (ASV) abundance data, which comprises an abundance matrix in the form that each row represents a sample site and each column represents an OTU (Operational Taxonomic Unit, commonly used in microbiology as proxies for species concept). This matrix directly represents an adjacency matrix for a species-location bipartite network.

For more information about ocean layers, kindly see Know Your Ocean from WHOI.

library(bipartite)
library(tidyverse)
library(vegan)
library(geomtextpath)
library(ggpubr)
library(furrr)

path <- "MediterraneanMicrobiome/"

Lets get the data imported and start with some basic data wrangling:

ASV <- read.delim(paste0(path, "data/ASV.txt"), sep = "\t", header = TRUE)
# ASV = abundance data of OTUs at sampling locations (OTU x location)
ENV <- read.delim(paste0(path, "data/ENV.txt"), sep = "\t", header = TRUE)
# ENV = environmental data of sampling locatoins
LOC <- read.delim(paste0(path, "data/Sample_Classification.tsv"), sep ="\t", header = T)
# LOC = Location data for sampling locations eg. depth / ocean layer
dfASV <- as.data.frame(t(ASV))
tENV <- as.data.frame(t(ENV))
colnames(dfASV) <- ASV$ID 
dfASV <- dfASV[-1, ]

dfASV <- dfASV %>% 
  mutate_at(1:5457, as.numeric) %>% # turning abundance data into numeric
  mutate(Sample=row.names(dfASV))  # adding column with sampling location names 
  # rename(dfASV$name, Sample = name) # naming this column "Sample"

colnames(tENV) <- ENV$ID # renaming the columns after the ID column in the ENV dataset. 
tENV <- tENV[-1, ] # deleting the ID row in the tENV dataframe
tENV <- tENV %>%
  mutate(Sample=row.names(tENV)) # adding column with sampling location names
  # rename(tENV$name, Sample = name) # naming this column "Sample"

new <- left_join(dfASV, LOC, by = NULL) # joining ASV and LOC dataframes
new <- left_join(new, tENV, by = NULL) 

Subsetting for mediterranean sample locations:

new.MS <- new[new[, "OceanRegion"] == "MS", ] # subsetting for samples from Mediterranean sea (MS)
sp.MS <- new.MS[, 1:5457] # subsetting only the species abundance part of the MS dataframe
sp.MS <- sp.MS %>% 
  mutate_at(1:5457, as.numeric) # tuning df into numeric data.
sp.MS <- sp.MS[, colSums(sp.MS) > 0] # only keeping columns where the sum is > 0. therefore getting rid off all OTUs not found in mediterranean sea

sp.MS <- sp.MS %>%
  mutate(Sample=new.MS$Sample) 
  # rename(sp.MS$name, Sample = name)

ENV.MS <- new.MS[, 5458:5473] # subsetting only location and environmental data of sampling locations in the MS

MS <- left_join( # rejoining species abundance and location / environmental data of MS
  sp.MS, 
  ENV.MS, 
  by = NULL
)

ab.mtx.ncol <- 3208

MS is now a dataframe comprising n = 145 rows as sample sites and p = 3208 columns of OTU, which stores the abundance data in the form of an adjacency matrix. There are additional 16 columns at the end for ecological data.

Non-metric Multi-Dimensional Scaling (NMDS)

The most intuitive way to have an overview on the community is to plot them out, NMDS is the most common way to do this.

library(tidyverse)
library(vegan)

set.seed(1)

ab.mtx <- as.matrix(MS[, 1:ab.mtx.ncol]) # extract abundance matrix

Data Preparation

Taking a brief look at the data, it’s very clear that out microbial data is actually largely skewed. There are a few species with extremely high reads, and a vast majority being very low frequent, and almost 85% of the cells in our matrix are zero.

ab.mtx[ab.mtx == 0] %>% length() / ab.mtx %>% length()
[1] 0.8453521
max(ab.mtx)
[1] 59282
hist(ab.mtx)

It’s worth noting that vegan::metaMDS(), the function that does the calculation for NMDS, will try to autotransform the data if not specified autotransform = False. In our case, considering the extreme skewness of the data and clear pattern, which transforming method to use did not really make big differences. But this is not always true. One should select these methods cautiously. Hellinger transformation is one way to reduce the weight of extremely high value.

ab.mtx.tf <- decostand(ab.mtx, "hellinger")
hist(ab.mtx.tf)

Perform NMDS

Every community ecologist should be familiar with the basic idea of NMDS, which is dimension reduction. It converts our n x p matrix into an n x n dissimilarity matrix. This steps allows flexible decision on how the distances should be calculated. A common choice while dealing with species composition is the Bray-Curtis dissimilarity, as its semi-metric approach takes both presence/absence and abundance information into account. It is also default to metaMDS().

NMDS differs a bit from other ordination methods, like PCA, in handling the distances and calculating axes and coordinates. Its non-metric nature decided that the distance on the final plot does not represents the distances between dataset, but the ranking order. It also therefore, does not try to maximize the variability in axes, which means the axes are completely arbitrary and this technique is mainly used for visualization only (Bakker 2024).

Before starting NMDS, we have to specify k, which gives the number of axes we want to have. NMDS will then take iterative process to adjust the scaling. After each iteration, it calculates a stress values showing how nice the ordination represents real data. The value ranges from 0 to 1, where 0 means a perfect fit. It however indicates danger for interpreting the ordination already from 0.2. Generally a stress value under 0.1 is considered a good result (Clarke 1993; Bakker 2024).

NMDS <- metaMDS(ab.mtx.tf, k=2, autotransform = F, trymax = 50)
Run 0 stress 0.07999079 
Run 1 stress 0.08285436 
Run 2 stress 0.07999154 
... Procrustes: rmse 0.0001703311  max resid 0.001452497 
... Similar to previous best
Run 3 stress 0.07999155 
... Procrustes: rmse 0.0001706899  max resid 0.00145164 
... Similar to previous best
Run 4 stress 0.07999154 
... Procrustes: rmse 0.0001700711  max resid 0.001453166 
... Similar to previous best
Run 5 stress 0.08285435 
Run 6 stress 0.08285435 
Run 7 stress 0.08285435 
Run 8 stress 0.08285435 
Run 9 stress 0.07999154 
... Procrustes: rmse 0.0001701321  max resid 0.00145255 
... Similar to previous best
Run 10 stress 0.07999154 
... Procrustes: rmse 0.0001702127  max resid 0.001452927 
... Similar to previous best
Run 11 stress 0.08285436 
Run 12 stress 0.08285435 
Run 13 stress 0.07999079 
... New best solution
... Procrustes: rmse 8.647632e-06  max resid 3.669006e-05 
... Similar to previous best
Run 14 stress 0.07999154 
... Procrustes: rmse 0.0001693804  max resid 0.001455979 
... Similar to previous best
Run 15 stress 0.07999079 
... New best solution
... Procrustes: rmse 3.17566e-06  max resid 1.566341e-05 
... Similar to previous best
Run 16 stress 0.07999155 
... Procrustes: rmse 0.0001701166  max resid 0.001452896 
... Similar to previous best
Run 17 stress 0.08285435 
Run 18 stress 0.07999078 
... New best solution
... Procrustes: rmse 5.149409e-06  max resid 2.555982e-05 
... Similar to previous best
Run 19 stress 0.08285435 
Run 20 stress 0.08285435 
*** Best solution repeated 1 times
NMDS

Call:
metaMDS(comm = ab.mtx.tf, k = 2, trymax = 50, autotransform = F) 

global Multidimensional Scaling using monoMDS

Data:     ab.mtx.tf 
Distance: bray 

Dimensions: 2 
Stress:     0.07999078 
Stress type 1, weak ties
Best solution was repeated 1 time in 20 tries
The best solution was from try 18 (random start)
Scaling: centring, PC rotation, halfchange scaling 
Species: expanded scores based on 'ab.mtx.tf' 

NMDS Plot using {ggplot2}

There are several plotting functions from {vegan}, but since we were trying to make a fancy poster, I brought out my old script to make fully controlled plot with {ggplot2}.

# Extract coordinates
siteCoord <- as_tibble(scores(NMDS)$sites)
spCoord <- as_tibble(scores(NMDS)$species)

# Import aesthetics setting
source(paste0(path, "analysis/AES.R"))

# Plot the species
p.NMDS.sp <- ggplot() +
    geom_point(data = spCoord, aes(x = NMDS1, y = NMDS2), size = 2, shape = 3, color = "grey", alpha = 0.25) +
    aes_ACTCP_ggplot() +
    theme(aspect.ratio = 1) +
    coord_equal()

p.NMDS.sp

As simple as that, we have our species plotted. But we are not really interested in individual species, instead, the community at different sites. For that purpose, we have to first pass the environmental data of each site into siteCooord.

# Extract environmental data, combine DCM with EPI
env <- MS[, (ab.mtx.ncol+1):ncol(MS)] %>% 
    mutate(layer = case_when(layer == "DCM" ~ "EPI", .default = layer))
siteCoord_env <- cbind(siteCoord, env)

# Order the layer 
siteCoord_env_ordered <- siteCoord_env %>% mutate(layer = factor(layer, levels = c("SRF", "EPI", "MES", "BAT")))

Then we can plot them as usual:

# Define colors for the layers
depth.colors <- c(
    "SRF" = "#6CCCD6",
    "EPI" = "#007896", 
    "MES" = "#004359",
    "BAT" = "#001C25"
)

depth.shapes <- c(
    "SRF" = 15,
    "EPI" = 16, 
    "MES" = 17,
    "BAT" = 18
)

p.NMDS.st <- ggplot() +
    geom_point(data = siteCoord_env_ordered, aes(x = NMDS1, y = NMDS2, color = layer, shape = layer), size = 3.5, alpha = 0.8) +
    aes_ACTCP_ggplot() +
    theme(aspect.ratio = 1) +
    scale_color_manual(values = depth.colors) +
    scale_shape_manual(values = depth.shapes) +
    guides(color = guide_legend("Layer"), fill = "none", shape = guide_legend("Layer"))

p.NMDS.st

Now, this is telling us more information, we can see that sample sites from the same layer tend to cluster together, where MES and BAT form a big group distinct from EPI and SRF. A common way to show such categorical data in NMDS is to draw a convex hull on those clusters.

NMDS Hullplot using {ggplot2}

The basic idea is simply use chull() to identify the points from the same cluster that are on the border of the hull and draw a polygon using those points. But the problem is that we have some outliers here, especially from MES (those which neighboring BAT and SRF). These points will surely become the one defining the convex hull, and make the hull covering much more area than the cluster really does.

To address this, I additionally calculate the euclidean centroids of each cluster and filter out outliers by comparing their distances to the centroids with the median distances to the centroids from all points within the same cluster. Those points that has a distance larger than a threshold calculated from median distance with defined multiplier will not be included.

# Calculate the coordinates of the hull border
outliner <- function(coord, group_var,  threshold_factor = 2, threshold_func = median) {
    centroid <- colMeans(coord[, c(1, 2)])
    centroids <- matrix(c(rep(centroid[1], nrow(coord)), rep(centroid[2], nrow(coord))), ncol = 2)
    
    # Euclidean distances
    coord$distances <- sqrt(rowSums((coord[, c(1, 2)] - centroids)^2))

    # Set threshold
    threshold <- threshold_factor * threshold_func(coord$distances)

    # filter out points with distance larger than threshold
    coord_in <- coord[coord$distances <= threshold, ]

    outline_indices <- chull(coord_in[, c(1, 2)])

    return(list(
        borders = coord_in[outline_indices, ], 
        centroid = c(centroid, group = group_var[[1]])
    ))
}

threshold_factor controls how conservative or inclusive the convex hull should be drawn. Then we apply this process to each category and extract the hull border coordinatees for ggplot. threshold_func defines which function to use for choosing baseline of the threshold. Default is set to median() here since mean() can skew the baseline if any extreme outlier exists.

# Using ordered factor from `siteCoord_env_ordered` can't successfully pass the layer name through the pipeline
layer.siteCoord <- siteCoord_env %>% 
    group_by(layer) %>%
    group_map(~ outliner(.x, .$layer), .keep = T) %>% 
    `names<-`(group_keys(siteCoord_env %>% group_by(layer))[[1]]) %>%
    do.call(what = rbind, args = .) %>%
    as.data.frame()

# Extract border and centroid coordinates
borders.pCoord <- layer.siteCoord[["borders"]] %>% bind_rows()
centroids.pCoord <- layer.siteCoord[["centroid"]] %>% bind_rows() %>% mutate(NMDS1 = as.numeric(NMDS1), NMDS2 = as.numeric(NMDS2))

p.NMDS.st.hull <- p.NMDS.st +
    geom_polygon(data = borders.pCoord, aes(x = NMDS1, y = NMDS2, fill = layer), alpha = 0.5) + # hull
    geom_text(data = centroids.pCoord, aes(x = NMDS1, y = NMDS2, label = group)) +
    scale_fill_manual(values = depth.colors) 

p.NMDS.st.hull

Now, each layer forms a cluster that does not have much overlap with others, except EPI and SRF, which also make sense, because even microorganism in SRF might have different interacting partners than the rest of EPI, the environmental condition and the general species compositions, in particular functionally, should be similar. This is also reflected by SRF on the right side of EPI with larger distance to MES or BAT.

The space between MES and EPI is not encapsulated by any of both, it reflects the “bridge” section, probably at which depth that both layers meet each other. There’s a numerical depth data in env, which is principally continuous data complementing our layer categories. While we use hull plot to visualize categeories in NMDS, we can use contour line to visualize continuous variables.

Fitting smooth surface

If we imagine the variables we want to display as the third dimension of our points on NMDS, visualizing the variables becomes simply visualizing a 3-D structure on our 2-D plot. One of the solution is to fit the surface on our plot using vegan::ordisurf().

The way to extract contour line from vegan::ordisurf() and the function extract.xyz() are credited to Christopher Chizinski’s post.

library(geomtextpath)
extract.xyz <- function(obj) {
    print(obj)
  xy <- expand.grid(x = obj$grid$x, y = obj$grid$y)
  xyz <- cbind(xy, c(obj$grid$z))
  names(xyz) <- c("x", "y", "z")
  return(xyz)
}

depth.surf <- ordisurf(NMDS, env$depth, permutation = 999, na.rm = T, plot = F)
depth.contour <- extract.xyz(obj = depth.surf)

Family: gaussian 
Link function: identity 

Formula:
y ~ s(x1, x2, k = 10, bs = "tp", fx = FALSE)

Estimated degrees of freedom:
7.84  total = 8.84 

REML score: 1057.237     
p.NMDS.st.hull.cont <- p.NMDS.st.hull +
    geom_textcontour(data = depth.contour, aes(x, y, z = z), color = "#c4438c") 

p.NMDS.st.hull.cont

We see those points on the MES/EPI border could be perfectly explained by the depth.

But, what we did so far was basic descriptive work in community ecology. We still want to take a look into each community structure using network analysis. I will cover this in the next upcoming post. To be continued.

Reference

Bakker, Jonathan D. 2024. Applied Multivariate Statistics in R. University of Washington. https://uw.pressbooks.pub/appliedmultivariatestatistics/.
Clarke, K. R. 1993. “Non-Parametric Multivariate Analyses of Changes in Community Structure.” Australian Journal of Ecology 18 (1): 117–43. https://doi.org/10.1111/j.1442-9993.1993.tb00438.x.
Deutschmann, Ina M., Erwan Delage, Caterina R. Giner, Marta Sebastián, Julie Poulain, Javier Arístegui, Carlos M. Duarte, et al. 2024. “Disentangling Microbial Networks Across Pelagic Zones in the Tropical and Subtropical Global Ocean.” Nat Commun 15 (1): 126. https://doi.org/10.1038/s41467-023-44550-y.

R session


R version 4.4.1 (2024-06-14) Platform: aarch64-apple-darwin20 Running under: macOS 15.4.1

Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Berlin tzcode source: internal

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] furrr_0.3.1 future_1.34.0 geomtextpath_0.1.4
[4] ggpubr_0.6.0 bipartite_3.0.0.1 sna_2.8
[7] network_1.19.0 statnet.common_4.10.0 vegan_2.6-8
[10] lattice_0.22-6 permute_0.9-7 lubridate_1.9.3
[13] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
[16] purrr_1.0.4 readr_2.1.5 tidyr_1.3.1
[19] tibble_3.2.1 ggplot2_3.5.2 tidyverse_2.0.0
[22] knitr_1.48 rmarkdown_2.28 pacman_0.5.1

Packages


bipartite

Version: 3.0.0.1

Dormann C, Fruend J, Bluethgen N, Gruber B (2009). “Indices, graphs and null models: analyzing bipartite ecological networks.” The Open Ecology Journal, 2, 7-24.

Dormann C, Gruber B, Fruend J (2008). “Introducing the bipartite Package: Analysing Ecological Networks.” R News, 8(2), 8-11.

Dormann C (2011). “How to be a specialist? Quantifying specialisation in pollination networks.” Network Biology, 1(1), 1-20.

furrr

Version: 0.3.1

Vaughan D, Dancho M (2022). furrr: Apply Mapping Functions in Parallel using Futures. R package version 0.3.1, https://CRAN.R-project.org/package=furrr.

geomtextpath

Version: 0.1.4

Cameron A, van den Brand T (2024). geomtextpath: Curved Text in ‘ggplot2’. R package version 0.1.4, https://CRAN.R-project.org/package=geomtextpath.

ggpubr

Version: 0.6.0

Kassambara A (2023). ggpubr: ‘ggplot2’ Based Publication Ready Plots. R package version 0.6.0, https://CRAN.R-project.org/package=ggpubr.

knitr

Version: 1.48

Xie Y (2024). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.48, https://yihui.org/knitr/.

Xie Y (2015). Dynamic Documents with R and knitr, 2nd edition. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 978-1498716963, https://yihui.org/knitr/.

Xie Y (2014). “knitr: A Comprehensive Tool for Reproducible Research in R.” In Stodden V, Leisch F, Peng RD (eds.), Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595.

pacman

Version: 0.5.1

Rinker TW, Kurkiewicz D (2018). pacman: Package Management for R. version 0.5.0, http://github.com/trinker/pacman.

rmarkdown

Version: 2.28

Allaire J, Xie Y, Dervieux C, McPherson J, Luraschi J, Ushey K, Atkins A, Wickham H, Cheng J, Chang W, Iannone R (2024). rmarkdown: Dynamic Documents for R. R package version 2.28, https://github.com/rstudio/rmarkdown.

Xie Y, Allaire J, Grolemund G (2018). R Markdown: The Definitive Guide. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 9781138359338, https://bookdown.org/yihui/rmarkdown.

Xie Y, Dervieux C, Riederer E (2020). R Markdown Cookbook. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 9780367563837, https://bookdown.org/yihui/rmarkdown-cookbook.

tidyverse

Version: 2.0.0

Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.

vegan

Version: 2.6.8

Oksanen J, Simpson G, Blanchet F, Kindt R, Legendre P, Minchin P, O’Hara R, Solymos P, Stevens M, Szoecs E, Wagner H, Barbour M, Bedward M, Bolker B, Borcard D, Carvalho G, Chirico M, De Caceres M, Durand S, Evangelista H, FitzJohn R, Friendly M, Furneaux B, Hannigan G, Hill M, Lahti L, McGlinn D, Ouellette M, Ribeiro Cunha E, Smith T, Stier A, Ter Braak C, Weedon J (2024). vegan: Community Ecology Package. R package version 2.6-8, https://CRAN.R-project.org/package=vegan.

Citation

BibTeX citation:
@online{hu2025,
  author = {Hu, Zhehao},
  title = {Descriptive {Community} {Ecology} Using {NMDS}},
  date = {2025-05-16},
  url = {https://zzzhehao.github.io/biology-research/community-ecology/MediMicro-NMDS.html},
  langid = {en},
  abstract = {This is the first post of the Mediterranean Microbial
    Network series.}
}
For attribution, please cite this work as:
Hu, Zhehao. 2025. “Descriptive Community Ecology Using NMDS.” May 16, 2025. https://zzzhehao.github.io/biology-research/community-ecology/MediMicro-NMDS.html.