Simple guide on performing TSNE

Multivariate Analysis

Author

Affiliation

Zhehao Hu

Department of Biology, University of Hamburg

Published

March 27, 2026

Abstract

Simpel guide on performing TSNE

Env setup

# use install.packages() if not already installed.
library(tidyverse)
library(vegan)
library(Rtsne)
library(paletteer)
library(cli)

path <- "proteomics"

Data preparation

Data import:

peak matrix should be the binned, but untransformed peak matrix. If already hellinger transformed, then just skip the decostand step later.

peakMatrix <- readRDS(file.path(path, "assets/tsne-1/peakmatrix_pc.rds"))
metadataSample <- readRDS(file.path(path, "assets/tsne-1/sample_metadata.rds"))

Hellinger transform

peakMatrix.hellinger <- decostand(peakMatrix, "hellinger")

Important is that the row names of the peak matrix should correspond to a key column in metadata (reference table) so that each row in peak matrix can match to their metadata later. Here I have the column sampleName in meatadata dataframe and I check the rownames agains it:

str(metadataSample) # structure of the reference table

'data.frame':   93 obs. of  10 variables:
 $ sampleName     : chr  "VPSM001_1" "VPSM002_1" "VPSM003_1" "VPSM003_2" ...
 $ name           : chr  "VPSM001_1.A1" "VPSM002_1.B1" "VPSM003_1.C1" "VPSM003_2.B9" ...
 $ ID.maldi       : chr  "VPSM001" "VPSM002" "VPSM003" "VPSM003" ...
 $ ID.DZMB2HH     : chr  "3641" "3808" "3823" "3823" ...
 $ station        : chr  "61" "81" "81" "81" ...
 $ gensp_morpho_ZH: chr  "Haploniscus unicornis complex" "Haploniscus unicornis complex" "Haploniscus charcoti" "Haploniscus charcoti" ...
 $ sex_ZH         : chr  "female" "female" NA NA ...
 $ stage_ZH       : chr  NA NA NA NA ...
 $ voucher_valid  : chr  "VPS001" "VPS002" "VPS003" "VPS003" ...
 $ label          : chr  "VPSM001_1.A1_Haploniscus_unicornis_complex_female_3641_84" "VPSM002_1.B1_Haploniscus_unicornis_complex_female_3808_54" "VPSM003_1.C1_Haploniscus_charcoti_NA_3823_55" "VPSM003_2.B9_Haploniscus_charcoti_NA_3823_63" ...

# extract rownames from the matrix
sampleName <- rownames(peakMatrix.hellinger)
head(sampleName)

[1] "VPSM001_1" "VPSM002_1" "VPSM003_1" "VPSM003_2" "VPSM004_1" "VPSM005_1"

# check all rownames has a match in the column of the dataframe
all(sampleName %in% metadataSample$sampleName)

[1] TRUE

Define TSNE

We define a function that performs TSNE and plots it out, so the process can be scaled up later:

tsne <- function(
    perplexity,
    matrix,
    metadata,
    key_col,
    color_by,
    dims = 2,
    max_iter = 5000,
    seed = NULL
) {
    if (!is.null(seed)) {set.seed(seed)} 

    tsne.res <- Rtsne(dims = dims, X = matrix, theta = 0.0, max_iter = max_iter, perplexity = perplexity)

    # extract tsne coordinates
    colnames(tsne.res$Y) <- LETTERS[c(24:26, 1:23)][1:ncol(tsne.res$Y)]
    tsne.coords <- as_tibble(tsne.res$Y)

    # glue tsne result with metadata
    tsne.matrix <- tsne.coords %>% 
        bind_cols(tibble(!!key_col := sampleName)) %>% 
        left_join(., metadata, by = key_col)

    # exit if not 2-dim TSNE
    if (dims != 2) {
        cli::cli_alert_danger("Plot option is only for 2-dimensional TSNE. Returning matrix only.")
        return(list(matrix = tsne.matrix, plot = NULL))
    }

    # Plot tsne
    tsne.plot <- ggplot() +
        geom_point(
            data = tsne.matrix, 
            aes(
                x = X, 
                y = Y, 
                color = .data[[color_by]]), 
            size = 3) +
        labs(title = paste0("t-SNE, perplexity = ", perplexity), x = "TSNE 1", y = "TSNE 2")  +
        theme_minimal() +
        theme(
            aspect.ratio = 1,
            plot.title = element_text(size = 12, face = "bold", family = "Times New Roman", hjust = 0.5),
            axis.ticks.length = unit(-0.05, "in"),
            axis.ticks = element_blank(),
            plot.background = element_blank(),
            legend.background = element_blank(),
            panel.background = element_rect(color = NULL, fill = "white"),
            panel.border = element_rect(color = "black", fill = "transparent"),
            panel.grid = element_line(linewidth = 0.3),
            legend.text = element_markdown()
        )

    return(list(matrix = tsne.matrix, plot = tsne.plot))
}

This function takes arguments:

perplexity: numeric. Hyperparameter of TSNE. Should not be bigger than 3 * perplexity < nrow(matrix) - 1. Also see ?Rtsne::Rtsne.
matrix: matrix. Peak matrix.
metadata: data frame or tibble. Reference table, must contain one column against which the rownames of the matrix can be matched.
key_col: character. Column name of the metadata table against which the rowname of the matrix should be matched.
color_by: character. Column name of the metadata table by which the dots in TSNE should be colored.
dims: numeric. Dimension of the TSNE. Default to 2. If greater than 2, only TSNE result matrix will be returned. If performing 3-dimensional TSNE, {plotly} can visualize 3-dimensional data.
max_iter: numeric. Maximum iteration. Default to 5000.
seed: numeric. Seed for reproducibility. Default to NULL.

and returns a list:

[[matrix]]: TSNE result.
`[[plot]]``: TSNE visualization with ggplot (2-dim TSNE only).

Perform TSNE

As you see, there’s a hyperparameter perplexity for TSNE, whose optimal value could not be determined beforehand. There’s an empirical rule for the range to be consider, but the best value varies from dataset to dataset. Based on the suggestion from {Rtsne} package, we calculate the upper limit for this dataset:

perplexity.max <- floor((nrow(peakMatrix)-1)/3)
perplexity.max

[1] 28

Then we try a series of TSNE with different perplexity starting from 5 to 28 by every 5, that means c(5, 10, 15, 20, 25), or in a more programmatical way:

perplexity.step <- 5
perplexity <- perplexity.step*seq(perplexity.max%/%perplexity.step)

Then we perform TSNE for each perplexity:

tsne.list <- map( 
    perplexity, 
    ~tsne( # purrr-style anonymous function
        perplexity = .x, 
        matrix = peakMatrix, 
        metadata = metadataSample, 
        key_col = "sampleName", 
        color_by = "gensp_morpho_ZH", 
        seed = 1
    )
)

map(tsne.list, ~.x[["plot"]]) %>% walk(~print(.x))

The pattern of 20 and 25 looked the same, and 15 hardly showed difference from them either. I would avoid 5 as it’s very low, so for a balanced result, I would take 10 for this dataset.

For more about how TSNE behaves and what perplexity means for it, see this article.

Visualization

Now we pick the plot and do some adjustment for publication:

tsne.res <- tsne(
    perplexity = 10, 
    matrix = peakMatrix, 
    metadata = metadataSample, 
    key_col = "sampleName", 
    color_by = "gensp_morpho_ZH", 
    seed = 1
)
tsne.plot <- tsne.res[["plot"]]

If you print out tsne.plot at this step, notice that plot is identical to the one we saw earlier, as we have set the seed to the same value and therefore locked the random process. If we didn’t do so, the clustering result would be slightly different and therefore also the placement of the dots.

Then we can make adjustment for the plot, here’s just an example, more details see:

tsne.plot <- tsne.plot + 
    labs(
        title = "t-SNE",
        subtitle = "",
        color = "Species"
    )

tsne.plot

You can also look for the best color palette for your dataset here and change the color palette by adding scale_color function family from {ggplot2} and {paletteer}.

tsne.plot <- tsne.plot + 
    scale_color_paletteer_d("ggthemes::Miller_Stone")

Exporting the plot can be done by ggsave(), adjust parameter as needed.

ggsave("tsne_plot.svg", tsne.plot, width = 10, height = 8)

It’s possible to make the species name in legend italic (require package {ggtext}), but the formatting step needs to be done before TSNE:

library(ggtext)

# format species name in reference table
metadataSample.format <- metadataSample %>% 
    mutate(gensp_morpho_ZH = sprintf("*%s*", gensp_morpho_ZH))

# perform tsne
tsne.res <- tsne(10, peakMatrix, metadataSample.format, "sampleName", "gensp_morpho_ZH", seed = 1)

# extract tsne plot
tsne.plot <- tsne.res[["plot"]]

# format tsne plot
tsne.plot + 
    labs(
        title = "t-SNE",
        subtitle = "",
        color = "Species"
    ) +
    theme(
        legend.text = element_markdown() # treat legend text as markdown text
    )

As you see, this just wraps the species name in the reference table inside single *, which is standard markdown syntax for italic, and we let ggplot treat the text in legend as markdown text. So it’s also possible to exclude open nomenclature (such as “sp.”, “cf.”, etc) from italicizing, but it requires a whole bunch of functions for taxa name matching, etc. So it’s not included here.

Reference

R session and R packages

R session

R version 4.5.0 (2025-04-11) Platform: aarch64-apple-darwin20 Running under: macOS Sequoia 15.7.4

Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Berlin tzcode source: internal

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] ggtext_0.1.2 knitr_1.51 rmarkdown_2.30 pacman_0.5.1
[5] zWeb_0.0.1 cli_3.6.5 paletteer_1.6.0 Rtsne_0.17
[9] vegan_2.6-8 lattice_0.22-6 permute_0.9-7 lubridate_1.9.5 [13] forcats_1.0.0 stringr_1.6.0 dplyr_1.2.0 purrr_1.2.1
[17] readr_2.1.5 tidyr_1.3.2 tibble_3.3.1 ggplot2_4.0.2
[21] tidyverse_2.0.0

Packages

cli

Version: 3.6.5

Csárdi G (2025). cli: Helpers for Developing Command Line Interfaces. doi:10.32614/CRAN.package.cli https://doi.org/10.32614/CRAN.package.cli, R package version 3.6.5, https://CRAN.R-project.org/package=cli.

ggtext

Version: 0.1.2

Wilke C, Wiernik B (2022). ggtext: Improved Text Rendering Support for ‘ggplot2’. doi:10.32614/CRAN.package.ggtext https://doi.org/10.32614/CRAN.package.ggtext, R package version 0.1.2, https://CRAN.R-project.org/package=ggtext.

knitr

Version: 1.51

Xie Y (2025). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.51, https://yihui.org/knitr/.

Xie Y (2015). Dynamic Documents with R and knitr, 2nd edition. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 978-1498716963, https://yihui.org/knitr/.

Xie Y (2014). “knitr: A Comprehensive Tool for Reproducible Research in R.” In Stodden V, Leisch F, Peng RD (eds.), Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595.

pacman

Version: 0.5.1

Rinker TW, Kurkiewicz D (2018). pacman: Package Management for R. version 0.5.0, http://github.com/trinker/pacman.

paletteer

Version: 1.6.0

Hvitfeldt E (2021). paletteer: Comprehensive Collection of Color Palettes. R package version 1.3.0, https://github.com/EmilHvitfeldt/paletteer.

rmarkdown

Version: 2.30

Allaire J, Xie Y, Dervieux C, McPherson J, Luraschi J, Ushey K, Atkins A, Wickham H, Cheng J, Chang W, Iannone R (2025). rmarkdown: Dynamic Documents for R. R package version 2.30, https://github.com/rstudio/rmarkdown.

Xie Y, Allaire J, Grolemund G (2018). R Markdown: The Definitive Guide. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 9781138359338, https://bookdown.org/yihui/rmarkdown.

Xie Y, Dervieux C, Riederer E (2020). R Markdown Cookbook. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 9780367563837, https://bookdown.org/yihui/rmarkdown-cookbook.

Rtsne

Version: 0.17

Krijthe JH (2015). Rtsne: T-Distributed Stochastic Neighbor Embedding using Barnes-Hut Implementation. R package version 0.17, https://github.com/jkrijthe/Rtsne.

van der Maaten L, Hinton G (2008). “Visualizing High-Dimensional Data Using t-SNE.” Journal of Machine Learning Research, 9, 2579-2605.

van der Maaten L (2014). “Accelerating t-SNE using Tree-Based Algorithms.” Journal of Machine Learning Research, 15, 3221-3245.

tidyverse

Version: 2.0.0

Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.

vegan

Version: 2.6.8

Oksanen J, Simpson G, Blanchet F, Kindt R, Legendre P, Minchin P, O’Hara R, Solymos P, Stevens M, Szoecs E, Wagner H, Barbour M, Bedward M, Bolker B, Borcard D, Carvalho G, Chirico M, De Caceres M, Durand S, Evangelista H, FitzJohn R, Friendly M, Furneaux B, Hannigan G, Hill M, Lahti L, McGlinn D, Ouellette M, Ribeiro Cunha E, Smith T, Stier A, Ter Braak C, Weedon J (2024). vegan: Community Ecology Package. doi:10.32614/CRAN.package.vegan https://doi.org/10.32614/CRAN.package.vegan, R package version 2.6-8, https://CRAN.R-project.org/package=vegan.

Citation

BibTeX citation:

@online{hu2026,
  author = {Hu, Zhehao},
  title = {Simple Guide on Performing {TSNE}},
  date = {2026-03-27},
  url = {https://zzzhehao.github.io/post/research/techs/proteomics_tsne.html},
  langid = {en},
  abstract = {Simpel guide on performing TSNE}
}

For attribution, please cite this work as:

Hu, Zhehao. 2026. “Simple Guide on Performing TSNE.” March 27, 2026. https://zzzhehao.github.io/post/research/techs/proteomics_tsne.html.