Submitted by External_Oven_6379 t3_yd0549 in MachineLearning
LastVariation t1_itpa0b2 wrote
Maybe the distance between two similar images is on a different scale to the difference between two different categorical labels. Using one-hot for the categoricals means 2 different labels are always a distance 1 apart. It could be worth looking at the cosine distances between all image embeddings with a given label, and some average of those embeddings to get a sense of the scale.
Also one-hot might not be best if the categorical labels aren't actually orthogonal - e.g. you'd expect there to be correlations between images of "cats" and "kittens".
Have you thought about just using something like CLIP for embedding both image and label?
External_Oven_6379 OP t1_itpny3q wrote
Thank you for your input!
I checked on the scale of the VGG19 feature embedding. All values are between [0, 9.7]. So in that case, should the values of the onehot vector be either 0 and 9.7?
The labels are textures like floral or leopard. So you are right, they are not necessarily orthogonal, but it's difficult to estimate the correlation among these classes. So one-hot vectors were the most accessible to me.
I have read about CLIP when starting this. My thoughts were that CLIP input consists of images and a text input like an image description, e.g. "Flowers in the middle of a blue floor" (which is not categorical). Could categorical text be used?
LastVariation t1_itps1fq wrote
R.e. the scale of one-hot vectors, it's a little hard to say, it probably depends on your data and task. Essentially you could scale the one hot vectors up by sqrt(K), where K is the average similarity of two images with the same label. That way having the same label has the cosine similarity as two images being averagely similar for the label. In practice you'd probably want to fit K as a hyperparameter with some training data.
R.e. CLIP, you can input categorical text labels as raw text and the model is decent at interpreting it. I believe it's common practice to make the text a bit more natural language in that case, so "a photo of a <object>" rather than just "<object>".
[deleted] t1_itqrndi wrote
[deleted]
Viewing a single comment thread. View all comments