# Is your UMAP lying to you?

## Intro: to see is to know

Humans are very visual creatures, and the way we experience science - especially biological science - is predominantly through **visualizations**. From beautiful immunofluorescence stainings to western blot bands, going through action potential recordings, scientists *look* at data before anything else.

Of course we need to be rigorous in all quantitative aspects: using sensible distributional assumptions, fitting models properly, applying the right tests, and reporting all the right numbers - corrected p-values, effect sizes, areas under the curve, et cetera.

But *seeing data* is the first and probably most direct way - and arguably one of the most important! - to interact with it and draw or reject even the most basic initial conclusions.

**Anscombe's Quartet** is a famous example. Four series of 2D points that have the same correlation coefficient, linear regression coefficients, means, and variances, and yet are very different when plotted.

Anscombe's quartet - taken from its Wikipedia page

Had we only trusted the summary statistics, we would be tempted to conclude that these distributions are ver similar and treat them in the same way. But it's only by plotting them that we understand how different they actually are, and how maybe it does not make sense to even compare them.

The take home message? Always plot your data!

## A high dimensional headache

High dimensional -omics, such as **single cell RNA-seq**, presents a very interesting paradox.

We strive to use the best quantitative approaches to feature quantification, normalization, dimensionality reduction, clustering, trajectory inference and so forth to make sure that we are modelling biological processes in a way that is sensible, sensitive, accounts for confounding effects and removes technical variability; we want to make sure that we use the right tests (if they exist) and that our p-values follow expected distributions, and we want to make sure that our clusters reflect an underlying, possibly novel, biological heterogeneity.

But **scRNA-seq data poses a problem**: if the data exists in a highly dimensional space, where every cell has a point whose coordinates are determined by its features (genes, genomic regions, etc), *how do we represent its structure in an understandable way*? We have developed many sophisticated mathematical tools to deal with this high dimensionality: we have de-noising approaches to prioiritize and reduce dimensionality; graphs can be constructed in any number of dimensions (provided distances still retain some kind of meaning); clustering data with actual heterogeneity seems to work most of the time, and we can infer biologically meaningful pseudotemporal trajectories.

But what does the data *look like*?

Unfortunately, we mortals can only see in 3 dimensions, and represent 2 of them on our primitive computer screens (let alone on paper). A high dimensional space is even difficult to imagine!

Enter **2D embeddings**. By applying some fancy transformations, we can compress our 20,000- (or 50-) dimensional space to 2 dimensions, in such a way that relationships between cells are preserved. By relationships we actually mean something rather intuitive: if two cells are **similar** (i.e. express a similar set of genes at similar levels) they should be **close** in the 20,000D, 50D or 2D space. A good 2D embedding is able to place cells in a vicinity - in the same neighborhood - if they are similar, and possibly place them in two distant neighborhoods if they are different.

This sounds like it should be simple, *but it actually isn't*. Every time we move from a higher-dimensional space to a lower-dimensional space, we are making compromises that require us to introduce distortions.

## Is Greenland as big as Africa?

An intuitive example of this can be experienced very easily by looking at a 3D globe of planet Earth, and a 2D atlas.

Consider the 2D Google Maps projection of **Greenland**, and compare it to the same projection, further to the South East, of **Africa**. Putting them side by side will reveal something impressive: **Greenland looks as big as Africa**.

That can't be right, can it?

Let's go to the 3D Google Earth rendering, and center our view on Greendland first, and then on Africa.

Sanity is restored! *But what happened*?

Any 2D map that aims at representing ALL of the 3D globe, without loss of continuity (as far as possible), will require a projection from 3D to 2D; any projection implies a distortion. By going from 3D to 2D we compress some parts of the globe while we expand others, so that our flat representation fits a rectangle, or a square, or any other shape we can draw or print on a sheet of paper.

This is a **well known cartographic issue** and one of the reasons why so many different projections exist, each with its own distortions and limitations. But luckily, the reason why we can still use maps without getting lost or running out of gas, is that relative distances are preserved: as absurd as it is to think that Greenland is bigger than Africa (a distortion that, by the way, is greatly reduced when zooming in - i.e. as we go from global to local), the distances between landmarks and cities in Greenland and in Africa reflects their 3D distances in a rather faithful manner.

This is to say that in a 2D map of Africa, Cape Town is still closer to the Kalahari Desert than it is to Marrakech, and in the Greenland counterpart Nuuk is closer to Attu than it is to the Denmarkshavn weather station. Things still make sense, and maps still serve their purpose. Mathematically, there is a certain guarantee on how much a map (intended as a lower dimensional representation of a set of points coming from a linear transformation of the original set) can make sense, given by the Johnson-Lindenstrauss lemma [1]. Interestingly, the limitations of the projection are not given only by how much we are shrinking our dimensionality (e.g. from 50 to 2) but also for how many points we are doing it.

## Beautiful lies

Why should we care about globes and maps for single cell data?

Most important operations on single cell data are performed in what some call the **"transcriptional space"**, i.e. a lower dimensional embedding obtained by reducing the original dimensions of the data to a more tractable, yet still highly dimensional space. Cells whose positions in space were given by 10,000-dimensional vectors are now embedded within a 50-dimensional space thanks to a linear transformation such as Principal Component Analysis (PCA). It is difficult to overstate how important this "place" is: cells are clustered in PCA space; batch effects are removed in PCA space, trajectories are calculated in PCA space, and so forth. However, since PCA space is still 20-, 30-, 50-dimensional, it is still un-representable.

For this reason, single cell RNA-sequencing results are often visualized through an additional transformation of high-dimensional data (think the PCA embedding) in 2D non-linear embeddings such as t-SNE or UMAP. These create beautiful representations where, somehow, cells that belong to the same cluster - which was identified in PCA space - also appear close together in 2D.

*Since clusters mostly match*, our visuals-hungry brain goes, *we can expect the 2D space to match as well in many other regards*. Clusters that are close in 2D may be more similar; samples that are well mixed in 2D may be globally overlapping; trajectories that go through certain parts of the plot and not others show biological cell fate choices. We look at our UMAPs and t-SNEs like they hold all the answers, using them to perform visual (although not only visual!) inferences on them.

This small cluster is detaching from the big one - *is this a disease phenotype*? This trajectory goes through this part of the 2D embedding - *is this a transient cell state that is shared between lineages*? Knowing what we know of distortions and distances, we would like to be more careful about over-interpreting these distances (see Tara Chari and Lior Pacther's preprint for a complete discussion on the subject); however, it is almost impossible not to do it.

How do we know wheter our UMAP is representing faithfully the structure of the data in PCA, or rather, **how do we now how much it is not**?

The `sleepwalk`

package by Svetlana Ovchinnikova and Simon Anders tries to give a very visual and compact solution to this issue. By supplying two different embeddings (such as PCA and UMAP), you can hover on a cell in the 2D UMAP space and see at what distance every other cell is in PCA space. This is a great diagnostic tool to link two different embeddings and check how much you can trust your UMAP: if points that are far apart in the 2D embedding appear close in PCA space (judging by their colour), we have to take embeddings around that cell with a grain of salt.

However, `sleepwalk`

is supposed to be an interactive solution which allows you to check every cell one by one.

Can we try to engineer a solution that gives as an idea of the distortion of distances for **the whole UMAP embedding**, rather than one by one?

## Getting to work

We will code this solution in R, as it can be very easily implemented. We will use a few BioConductor packages, which are already very well optimized for handling single cell data, and my very own `cellula`

package, which makes some of these steps easy to run.

We first load the necessary packages:

```
library(BiocNeighbors)
library(BiocParallel)
library(scRNAseq)
library(sleepwalk)
library(cellula)
library(parallelDist)
```

And we download a pancreas single cell RNA-seq dataset (from Muraro et al. 2016) using the `scRNAseq`

package, which hosts several published single cell RNA-seq datasets as `SingleCellExperiment`

objects:

`sce = MuraroPancreasData(ensembl = TRUE, location = FALSE)`

Now we do a quick processing with `cellula`

including batch effect correction (at the `donor`

level) using the `Harmony`

method:

`sce = cellula(sce, batch = "donor", name = "murarotest", integration_method = "Harmony")`

The resulting `SingleCellExperiment`

object will contain 4 reduced dimension embeddings: a `PCA`

and `UMAP`

(both before batch effect correction) and `PCA_Harmony`

and `UMAP_Harmony`

(after batch effect correction).

We take advantage of the `findKmknn()`

function from the highly efficient `BiocNeighbors`

package to find the k nearest neighbors in corrected PCA space for every cell, together with their distance. We use k = 10:

```
# Taking corrected PCA as reference, and UMAP as space we are testing
reference = reducedDim(sce, "PCA_Harmony")[,1:20]
tested = reducedDim(sce, "UMAP_Harmony")[,1:2]
pca_dists = findKmknn(X = reference, k = 10, get.distance = TRUE)
umap_dists = findKmknn(X = tested, k = 10, get.distance = TRUE)
```

At this point we can look at the Jaccard index for every cell, i.e. the overlap between every KNN neighborhood in PCA and its corresponding in UMAP. The Jaccard index is calculated as the number of elements in common (i.e. belonging to the same neighborhood in both PCA and UMAP) over the number of elements total, and is bound between 0 (no elements in common) and 1 (all elements in common). This is not a new idea: it has been described in this preprint by Shamus Cooley and colleagues, and I will use a very barebones reimplementation.

```
# First test: calculate Jaccard index between NN in PCA and NN in UMAP
jaccard_in_ref = lapply(seq_len(nrow(reference)), function(x) {
a = pca_dists$index[x,]
b = umap_dists$index[x,]
length(intersect(a,b))/length(union(a,b)) #Jaccard index
}) |> unlist()
sce$jac_dist = jaccard_in_ref
colorpal = colorspace::sequential_hcl(palette = "Purple-Yellow", n = 25)
plot_UMAP(sce, "UMAP_Harmony", color_by = "jac_dist", color_palette = colorpal)
```

We can also look at the global distribution of Jaccard index per cell on a histogram:

`hist(sce$jac_dist, main = "Jaccard index (k = 10)", xlab = "Jaccard index")`

A low Jaccard index in this case means that the cells that are the 10 nearest neighbors in UMAP are not the cells that are the 10th nearest neighbors in PCA. Cells with a low Jaccard index are cells for which there is a large amount of local distortion, since the distances don't match.

However, this is a rather strict measure and presents some limitations. If the nearest neighbours in UMAP were close, but not within the first 10, the Jaccard index would be the same (i.e. 0) as if they had been much farther away; the number of values is limited by the neighborhood size; moreover, this score gets better as the size of neighborhood increases, since we are more permissive:

```
# Try again with k = 30
pca_dists2 = findKmknn(X = reference, k = 30, get.distance = TRUE)
umap_dists2 = findKmknn(X = tested, k = 30, get.distance = TRUE)
jaccard_in_ref2 = lapply(seq_len(nrow(reference)), function(x) {
a = pca_dists2$index[x,]
b = umap_dists2$index[x,]
length(intersect(a,b))/length(union(a,b)) #Jaccard index
}) |> unlist()
sce$jac_dist2 = jaccard_in_ref2
plot_UMAP(sce, "UMAP_Harmony", color_by = "jac_dist2", color_palette = colorpal)
```

`hist(sce$jac_dist2, main = "Jaccard index (k = 30)", xlab = "Jaccard index")`

What we can see however is that cells on the "edge" of each cluster tend to be more properly placed than cells within the "core", since they have a higher correspondence between neighborhoods.

There is a more resource-intensive but possibly more robust way to quantify the distortion.

We start by calculating the whole distance matrix for the PCA space. This is normally a computationally intensive procedure, and something we tend to avoid for large datasets such as most scRNA-seq ones. We can take advantage of parallel processing in our computers using the `parallelDist`

package.

By taking a row of the distance matrix, we get the distances from a cell (row) to every other cell in PCA space. We rank these distances, and see how this rank compares to the rank of the 30 nearest neighbors (i.e. a numeric vector going from 1 to 30). The comparison is obtained simply by subracting the PCA distance rank from 1:30, and calculating the median, then taking the absolute value.

If the **AMRD (absolute median rank difference)** is close to 0, it means that most of the ranks mostly correspond: we subtracted 1 from 1, 2 from 2, 3 from 3 etc. If it is larger than 0, it means that we have been subtracting larger ranks from the original ones: we subtracted 100 from 1, 400 from 2, 2400 from 3 etc.

Finally, we add 1 to the rank calculation (to avoid 0's) and we log-transform the values to have a smoother color scale.

```
# Distance matrix
dmat = as.matrix(parallelDist::parDist(reference, threads = 9))
# Parallelized loop
ranks_in_ref = bplapply(seq_len(nrow(reference)), function(x) {
dists_ref = dmat[x,]
rdo = seq_len(30) # we keep 30 NN
rdf = rank(dists_ref)
ranks = matrix(c(rdo, rdf[umap_dists2$index[x,]]), nrow = 30)
abs(median(ranks[,1] - ranks[,2])) + 1
}, BPPARAM = MulticoreParam(workers = 9, progressbar = TRUE)
) |> unlist()
sce$rank_dist = log10(ranks_in_ref)
plot_UMAP(sce, "UMAP_Harmony", color_by = "rank_dist", color_palette = colorpal)
```

`hist(sce$rank_dist, main = "Absolute Median Rank Difference (k = 30)", xlab = "AMRD")`

This result is similar to the Jaccard index from before, but it allows to quantify **more precisely** the size of the distortion.

An AMRD value of 100 (or 2 if in log10 scale) means that the 10 nearest neighbours on UMAP for that cell are within - on average - the 100 nearest neighbours on PCA. Note that we are still making a big simplification by taking the median, as there is no guarantee that the other nearest neighbour ranks are tightly distributed around 100. However, the AMRD has still some advantages over the Jaccard index.

A Jaccard index is bound between 0 and 1 and can only have a limited number of possible values (i.e. 31 values between 0/60 and 30/30 in the case of 30 NNs). The AMRD can take any number of values between 0 and (number of cells - number of neighbours). Moreover, while the contribution to the Jaccard index is binary - a cell either is or is not in the neighborhood - the AMRD can have different values for the same values of Jaccard index. The downside is that for large datasets AMRD is very computationally intensive for matrix calculation and ranking of the distances.

## But wait, does it work?

It is always good to doubt yourself, especially when doing science. Are we sure that this experiment is performing properly? If we were in a lab we would run a **positive control**, and the good news is we can do that here too.

As a positive control we will run another UMAP on the same data, and look at Jaccard index and AMRD values between the two UMAPs rather than between an UMAP and a PCA. UMAP embeddings have a random component, so that no two UMAPs are the same (unless a random seed is specified). However, we expect the algorithm to work in a reproducible way and give us embeddings that have no distortions between them, i.e. **distances are preserved, Jaccard indices are high for all cells and AMRD values are low for all cells**.

We start with looking at the Jaccard indices:

```
# Recalculate UMAP with same parameters as `cellula`
u2 = uwot::umap(reference, n_neighbors = sqrt(ncol(sce)), min_dist = 0.7)
u2_dists = findKmknn(X = u2, k = 10, get.distance = TRUE)
umap_dists = findKmknn(X = tested, k = 10, get.distance = TRUE)
jaccard_in_ref_ctrl = lapply(seq_len(nrow(u2)), function(x) {
a = u2_dists$index[x,]
b = umap_dists$index[x,]
length(intersect(a,b))/length(union(a,b)) #Jaccard index
}) |> unlist()
sce$jac_dist_ctrl = jaccard_in_ref_ctrl
plot_UMAP(sce, "UMAP_Harmony", color_by = "jac_dist_ctrl", color_palette = colorpal)
```

We also compare the two histograms:

```
hist(sce$jac_dist_ctrl, main = "Jaccard index (k = 10) - control", xlab = "Jaccard index", col = rgb(0,0,0,0.3), breaks = 10, ylim = c(0,1000))
hist(sce$jac_dist,add = TRUE, col = rgb(1,0,0,0.3), breaks = 10)
legend("topright", bty = "n", pch = 15, col = c(rgb(0,0,0,0.3), rgb(1,0,0,0.3)), legend = c("control", "observed"))
```

As expected, the indices are much higher - but there is still some part of the dataset that seems to be more randomly distributed!

Now we look at AMRD:

```
dmat_ctrl = as.matrix(parallelDist::parDist(u2, threads = 9))
ranks_in_ref_ctrl = bplapply(seq_len(nrow(u2)), function(x) {
dists_ref = dmat_ctrl[x,]
rdo = seq_len(10)
rdf = rank(dists_ref)
ranks = matrix(c(rdo, rdf[umap_dists2$index[x,]]), nrow = 10)
abs(median(ranks[,1] - ranks[,2]))+1
}, BPPARAM = MulticoreParam(workers = 9, progressbar = TRUE)
) |> unlist()
sce$rank_dist_ctrl = log10(ranks_in_ref_ctrl)
colorpal = colorspace::sequential_hcl(palette = "Purple-Yellow", n = 25, rev = TRUE)
plot_UMAP(sce, "UMAP_Harmony", color_by = "rank_dist_ctrl", color_palette = colorpal)
```

```
hist(sce$rank_dist_ctrl, main = "Absolute Median Rank Difference (k = 10) - control", xlab = "log10(AMRD)", col = rgb(0,0,0,0.3))
hist(sce$rank_dist,add = TRUE, col = rgb(1,0,0,0.3))
legend("topright", bty = "n", pch = 15, col = c(rgb(0,0,0,0.3), rgb(1,0,0,0.3)), legend = c("control", "observed"))
```

Once again we have a very good correspondence (low AMRD) for almost all the data. As you can see, the AMRD is also less stringent than the Jaccard index in penalizing the neighborhood.

We can say, for now, that these measurements make sense as they behave in a much more predictable way when running a positive control, and we have confirmed the initial suspicion that looking at Jaccard indices was too strict a metric.

## Conclusions

We have seen how dimensionality reductions are distorted, and how this distortion comes about. We have also seen how to code a simple distortion metric in R.

What should we make of all this? The honest yet disappointing answer is: *I don't really know*.

Knowing which parts of a UMAP are less trustworthy than others is still very much a visual exercise, and it is difficult to incorporate this information in the biological interpretation of data. For sure, one has to 1. look at data from several different angles, including the visual one and 2. validate, validate, validate. No biological result exists in a vacuum, and it is important that we connect observations to one another to make sense of all this work.

This post was inspired by a complex debate ongoing on Twitter since a long time, which you can follow in its inception here. Lior Pachter's thread(s) include lots of theory, including the mention of the Johnson-Lindenstrauss lemma, and map distortion. There are a few important things to learn from that debate, but unfortunately there are still some questions that, to me, are left unanswered: are all UMAPs wrong? Are some more useful than others? Is there no way to craft a 2D embedding that does not lie? Or should we outright dismiss them all? What should computational biologists and other users of analysis tools do?

## Footnotes

1 The Johnson-Lindenstrauss lemma states that given a set of n points in an Euclidean space, it is possible to find a lower dimensional representation in k dimensions such that k ≥ O((log n)/(epsilon\^2)) where the distances between any two points are not distorted more than a factor of 1 ± epsilon. So if we consider a dataset with 1000 points, and a 10% error on the distances (epsilon = 0.1), we see that we need at least 690 dimensions. For an average dataset with 30,000 cells the upper bound of a 10% error is guaranteed with at least 1030 dimensions. What happens if we further decrease our dimensionality to 2? The error is no longer guaranteed, or rather we can esimate that the epsilon for k = 2 and n = 30,000 would be 2.27, i.e. 2D distances would be distorted up to a factor of of 1 ± 2.27.