resented_ape

resented_ape t1_iybprpj wrote

FWIW, I don't think that is what the paper demonstrates. The Picasso method the authors introduce uses a totally different cost function based on distance reconstruction. For a specific set of metrics the authors are interested in, they say that Picasso produces results comparable with UMAP and t-SNE. But it's not the UMAP or t-SNE objective.

With the scRNAseq dataset in the python notebook at the github page for picasso, I found that for the metrics one might usually be interested in with UMAP and t-SNE, e.g. neighborhood preservation (what proportion of the k-nearest neighbors in the input and output space are preserved) or Spearman rank correlation (or triplet ordering preservation) of input vs output distances, Picasso did (quite a bit) worse than UMAP and t-SNE.

This might not be relevant to for downstream scRNAseq workflows -- I will take the authors' word on that. At any rate, on my machine Picasso runs very slowly, and I found its own output to be visually unsatisfactory with the default settings with other datasets that I tried it with (e.g. MNIST), so I have been unable to generate a similar analysis for a wide range of datasets. So take that for what it's worth.

1

resented_ape t1_iy3xso2 wrote

The t-SNE paper came out in 2008, so things are not moving as quickly as OP suggests IMO.

For the t-SNE/UMAP family of algorithms, which can be boiled down to attempting to embed the k-nearest neighbors graph, Dmitry Kobak has published several papers of interest. I would point you to Attraction-Repulsion Spectrum in Neighbor Embeddings as a good place to start.

17

resented_ape t1_irr5f0k wrote

Your problem is really how to measure similarity between graphs. If you can do that, you can use pretty much any standard clustering technique.

This task is common in drug discovery and related cheminformatics fields where small molecules are represented as graphs. One common approach there is to generate a dictionary of subgraphs and then represent each molecule as a binary vector where a 1 means the subgraph is present and 0 otherwise. There is an obvious extension to counts of features (i.e. you have an integer vector with a N if the subgraph appears N times). This strategy crucially depends on how nodes and edges are labeled.

The similarity measure is usually something like Jaccard (often referred to as Tanimoto in the chemistry literature).

7