Viewing a single comment thread. View all comments

sapnupuasop t1_iu0pqof wrote

Does clustering on 100s of features make sense? Maybe spark could solve your problems but you must look if there are spark implementations of other algorithms

2

jesusfbes OP t1_iu0vao5 wrote

It does make sense, if your data point are of high dimension, say vector embedding for example. I believe that spark is an option, for algorithms that are not implemented as spectral clustering you have the primitives to make it yourself. Thank you for your response.

2

sapnupuasop t1_iu0zjy5 wrote

Yeah was thinking of curse of dimensionality, with standard Euclidean distance for example, distances in high dimensions lose their meaning, but there are surely other distance which could function there. Btw I have used sklearn to cluster on couple millions of rows with sklearn successfully

2

PassionatePossum t1_iu3gete wrote

I have done it before and it works well. But I guess it depends on the use-case. It is a classic technique in computer vision to cluster SIFT vectors (128 dimensions) on a training dataset. You then describe any image as a set of „visual words“ (i.e. the IDs of the clusters its SIFT vectors fall into).

A colleague of mine wrote the clustering algorithm himself. It was just a normal k-means with the nearest neighbor search replaced by an approximate nearest neighbor search to speed things up.

1