Submitted by External_Oven_6379 t3_yd0549 in MachineLearning
I am currently working on a database retrieval framework, that takes an image and categorical text data, creates an embedding of these and calculates the distance of this combined embedding to other known datapoints. However, my results seem to be off.
So I was wondering, what would be an appropriate way of combining these embeddings?
The details about the embedding:
- image features are embedded with a pretrained vgg19 model
- categorical text features are embedded by creating one-hot vectors
- both embeddings are combined by concatenating the vectors
So in the end, i get a vector that looks like this: [image embedding(1,8192+ text embedding (1,137)]
Use of the embeddings:
The embeddings are then used to find the NearestNeighbors by calculating the cosine distance.
Question/Issue:
My question is, would that be an appropriate way of combining features of a sample in n-dimensional space? Are there any other/preferred ways?
LastVariation t1_itpa0b2 wrote
Maybe the distance between two similar images is on a different scale to the difference between two different categorical labels. Using one-hot for the categoricals means 2 different labels are always a distance 1 apart. It could be worth looking at the cosine distances between all image embeddings with a given label, and some average of those embeddings to get a sense of the scale.
Also one-hot might not be best if the categorical labels aren't actually orthogonal - e.g. you'd expect there to be correlations between images of "cats" and "kittens".
Have you thought about just using something like CLIP for embedding both image and label?