Viewing a single comment thread. View all comments

_Arsenie_Boca_ t1_irw1ti7 wrote

No, I believe you are right to think that an arbitrary image captioning model cannot accurately generate prompts that actually lead to a very similar image. Afterall, the prompts are very model-dependent.

Maybe you could use something similar to prompt tuning. Use a number of randomly initialized prompt embeddings, generate an image and backprop the distance between your target image and the generated image. After convergence, you can perform a nearest neighbor search to find the words closest to the embeddings.

Not sure if this has been done, but I think it should work reasonably well


MohamedRashad OP t1_irw4soy wrote

This is actually the first idea that came to me when thinking about this problem ... Backpropgating the output image until I reach the text representation that made it happen then use the distance function to get the closest words to this representation.

My biggest problem with this idea was the variable length of input words ... The search space for the best words to describe the image will be huge if there is no limit on the number of words that I can use to describe the image.


What are your thoughts about this (I would love to hear them)?


_Arsenie_Boca_ t1_irwat46 wrote

Thats a fair point. You would have a fixed length for the prompt.

Not sure if this makes sense but you could use an LSTM with arbitrary constant input to generate a variable-length sequence of embeddings and optimize the LSTM rather than the embeddings directly.