External_Oven_6379 OP t1_itysio2 wrote on October 27, 2022 at 8:47 AM

Reply to comment by Appropriate_Ant_4629 in Combining image and text embedding [P] by External_Oven_6379

that's interesting! Would be great to share some experiences and knowledge

External_Oven_6379 OP t1_ityshd9 wrote on October 27, 2022 at 8:46 AM

Reply to comment by londons_explorer in Combining image and text embedding [P] by External_Oven_6379

thanks for this input! I will try this out

External_Oven_6379 OP t1_itysdcc wrote on October 27, 2022 at 8:45 AM

Reply to comment by DigThatData in Combining image and text embedding [P] by External_Oven_6379

thank you for your input. Since I conduct the project by myself, I have no one to bounce back ideas. This is the first time I am getting some input from an experienced audience. I don't know when I made that decision for the architecture exactly, but I remember that I also had openAI's CLIP on the table, but must have come to the conclusion that the mentioned approach could work better.... how wrong I was!

External_Oven_6379 OP t1_itpo4t5 wrote on October 25, 2022 at 12:08 PM

Reply to comment by Dear-Acanthisitta698 in Combining image and text embedding [P] by External_Oven_6379

I used the pretrained VGG 19 for the image. Regarding CLIP, I had the doubts above. I thought the categories are already the most dense form of information representation. Can you recommend a model, apart from CLIP?

External_Oven_6379 OP t1_itpny3q wrote on October 25, 2022 at 12:07 PM

Reply to comment by LastVariation in Combining image and text embedding [P] by External_Oven_6379

Thank you for your input!

I checked on the scale of the VGG19 feature embedding. All values are between [0, 9.7]. So in that case, should the values of the onehot vector be either 0 and 9.7?

The labels are textures like floral or leopard. So you are right, they are not necessarily orthogonal, but it's difficult to estimate the correlation among these classes. So one-hot vectors were the most accessible to me.

I have read about CLIP when starting this. My thoughts were that CLIP input consists of images and a text input like an image description, e.g. "Flowers in the middle of a blue floor" (which is not categorical). Could categorical text be used?