ockham_blade t1_j6onxem wrote on January 31, 2023 at 8:33 PM

Hi! I am working on a clustering project on a dataset that has some numerical variables, and one categorical variable with very high cardinality (~150 values). I was thinking if it is possible to create an embedding for that feature, after one-hot encoding (ohe) it. I was initially thinking of running an autoencoder on the 150 dummy features that result from the ohe, but then I thought that it may not make sense as they are all uncorrelated (mutually exclusive). What do you think about this?
On the same line, I think that applying PCA is likely wrong. What would you suggest to find a latent representation of that variable? One other idea was: use the 15p dummy ohe columns to train a NN for some classification task, including an embedding layer, and then use that layer as low-dimensional representation... does it make any sense? Thank you in advance!