MohamedRashad OP t1_irw4soy wrote

This is actually the first idea that came to me when thinking about this problem ... Backpropgating the output image until I reach the text representation that made it happen then use the distance function to get the closest words to this representation.

My biggest problem with this idea was the variable length of input words ... The search space for the best words to describe the image will be huge if there is no limit on the number of words that I can use to describe the image.


What are your thoughts about this (I would love to hear them)?


MohamedRashad OP t1_irvolp8 wrote

I thought about self-supervision for this task. Enter the image I want it's prompt to an Image-to-text model and the resulting text I feed to a diffusion model (DALL-E, Stable Diffusion) which I freeze their weights so they don't change.

The output image will be compared to the original image I entered and the loss will be backpropagated to the image-to-text model to learn. The problems with this approach (in my humble opinion) are two:

  1. Training such system won't be easy and I will need a lot of resources I currently don't have.
  2. And even if I succeed The resulting model won't be good enough for generalization.

This is of course if I managed to overcome the non-differentiable parts.


MohamedRashad OP t1_irvmz2q wrote

But in this case, I will need to train image captioning model on text-to-image data and hope that it will provide me with the correct prompt to recreate the image using the text-to-image model.

I think a better solution is to use the backward propagation in text-to-image models to get the prompt that made the image (an inverse state or something like it).