ZombieRickyB

ZombieRickyB t1_iy957ed wrote

I mean I imagine it is an embedding in the rigorous sense in some cases. Can't really prove it, though. The issue isn't really about curvature so much as distances, though. Can't flatten an object homeopathic (not necessarily diffeomorphic) to a sphere without introducing mega distortion in the metric space sense

1

ZombieRickyB t1_iy94v38 wrote

I mean, yeah, you're not wrong. If it works for you, it works for you. It's problem space dependent, and there's virtually no research that exists suggesting how much, if at all, things will be distorted in the process for given scenarios. For my work, I need to have theoretical bounds on conformal/isometric distortion, the distances involved are infinitely more important than classification accuracy. I work in settings where near perfect classification accuracy is absolutely not expected, so well separation of clusters just will call for question of results.

There have been a number of cases, both theoretical and in practice, where t-SNE and UMAP give results with questionable reliability. I'm sure I could get an example in my space with little effort as well, and I'd rather go through some nonlinear transforms I know well in terms of how they work than spend a bunch of time tuning optimization hyperparameters that could take forever.

1

ZombieRickyB t1_iy70dqa wrote

Visualization does not mean it's good for working in. The nonisometric nature is the killer. Your space is fundamentally distorted, what you input is not necessarily reflected in the visualizations.

Here is a stackexchange thread discussing the main issue (same holds for UMAP) but I'll highlight a different example: https://stats.stackexchange.com/questions/263539/clustering-on-the-output-of-t-sne

Probably the most famous example of showing why you really can't assume meaning in general comes from the stereographic projection. This is a diffeomorphism between the sphere without a point and the plane. Points close by on the punctured sphere, specifically around the puncture, absolutely need not be close on the plane; they'll actually be quite far apart. The minute you start working with diffeomorphisms of arbitrarily high conformal distortion, the worse you get your results. Is there data that is reasonably "sphere-valued"? Absolutely, I see it a bunch. Any attempt to flatten it will destroy the geometry in some fashion. This is just one example.

You have two things with these setups that put you at risk: you're taking a minimum of some sort, which never needs to be a smooth mapping, and generally a log of something appears somewhere, which fundamentally changes geometry. It's there, it's just usually hidden because someone at some point is assuming something is approximately Gaussian. That's why the energies look the way they do.

2

ZombieRickyB t1_iy60acn wrote

There is a large amount of distortion done in the last step of UMAP, namely the whole cross entropy optimization. The output is designed to look pretty and separate things, in the process being at risk of distorting distances and the underlying "space" for the sake of looking pretty. Clustering quality risks being questionable.

3

ZombieRickyB t1_iy4ox9u wrote

PCA doesn't just give a visual embedding, it gives an explicit coordinate system. (1,0) in a 2d dimensionality reduction example naturally corresponds to a particular unit vector in the initial space. If you know what your coordinates mean in that space, that gives guidance. Those unit vectors are generalized eigenvectors in the usual sense

A nice example of this: if you have a bunch of, say, black and white images of faces, vectorize them, perform PCA, take one of the eigenvectors, turn it back into an image, and display it, you get something resembling a face for at least the first few dimensions. By construction, these vectors are orthogonal, so there's some mechanism of variation that bears to be a little interpretable

24

ZombieRickyB t1_iy45jln wrote

I get the question but at the same time I don't. It really depends on what your goal is.

Case in point: I can use little more than PCA and maybe a bit of diffusion maps. There are fundamental properties that make me need these Can other methods separate better? Sure! There are some really neat pictures you can make. But, to get those pictures, there are changes that need to be made for my data that make things a bit unworkable unless I can invert them, and generally, I can't. This doesn't matter to other people, but it's everything to me.

State of the art is what you make of it. For me? PCA still is, as it will be for many others. Doesn't matter if you can separate better, that's like the least of my interests. That being said, what is it for you?

Note: I hate quantitative classification benchmarks in general so any take I have related to "SOTA" will always be biased.

2

ZombieRickyB t1_iufdnzi wrote

Research scientist in industry but I'll talk about something that I lead outside of my job, a project in computer vision designed to basically create a good open source alternative to a bunch of bio-imaging tools behind paywalls. I maintain the code base but work with my old lab to keep expanding and developing (at this point, I mostly do software work/troubleshoot methods from old papers that have weird issues). Currently focusing on anthropology but long term I want to push into computational neuroscience because I have a bunch of problems with that field from both the perspective as and adjacent researcher, and as family of a neurology patient...

Biggest problem is by far lack of existing tools outside of MATLAB, mostly for visualization and signal processing. Python is okay but not quite there in my experience. MATLAB alternatives are more or less in the same boat. The main challenge is that "performance evaluation" is inevitably qualitative. The benchmarks used to publish results aren't quite meaningless but they're mostly for show/make ML people happy. Practitioners don't really seem to care. That leads to a situation where I need really good 3D visualization tools that are interactive. Python hasn't been good for that. Current "free" solution is to go into Javascript but other issues arise because things just aren't configured properly for the space I work in.

The other big challenge is a really high bar for data quality prior to subsequent analysis. Most of my actual work here is spent in filling in niches of computer vision where almost nothing in top conferences/journals applies. This creates another challenge, since I have to do things from scratch. Say I'm registering 3D objects, because that's a lot of what I care about. Textures look mostly okay but have a problem using state of the art method? Rejection because prior work indicates that major findings in the past have been heavily biased by tiny issues...very little margin for error.

Then there's compute. To get meaningful attention, I have to assuming the users will have mediocre compute resources, most certainly no GPUs. Also a big limitation. Can't pre-train much of anything either.

If I had a genie, I'd mostly wish for interactivity in Python to be better because that would take care of a lot of headache...or for Julia or something to be more mature. Lots of interesting research to do and can be done in a short order, bottlenecked by lack of existing tools for intuitive interface design + lack of time to do them, even if I ended up getting paid.

2