Viewing a single comment thread. View all comments

Chrysomite t1_iu5kcxr wrote

I've done this to some extent with PCA-KNN.

You can reduce the number of dimensions prior to clustering using principal component analysis. You'll select the first n components based on how much of the variance you want explained (I usually stop at 95%).

You can borrow from computer graphics and use hierarchical spatial partitioning techniques to speed up clustering/searching. You can use binary space partitioning or k-d trees with hyperplanes. Data points reside in the leaf nodes. If a leaf node reaches a certain density, split it. I haven't tried it, and I'm a little unsure on the geometry, but maybe simple spatial hashing techniques will also work? Then keep track of neighboring spaces and only apply clustering to a subset of the space.

It's admittedly imperfect, but I expect it's a decent approximation at scale.

2