MohamedRashad OP t1_irvj29w wrote on October 11, 2022 at 11:06 AM

I thought about Image Captioning when I started my search but what I always found were models summarize the image not get me the correct prompts to recreate the image.

Blutorangensaft t1_irvmhb1 wrote on October 11, 2022 at 11:44 AM

What do you mean with "prompt"? How is it different from a caption?

m1st3r_c t1_irvmqv8 wrote on October 11, 2022 at 11:47 AM

The terms used in an image generation ai's prompt field which it uses to generate the art.

MohamedRashad OP t1_irvng87 wrote on October 11, 2022 at 11:54 AM

prompt of stable diffusion (for example) is the text that will result in the image I want.

the text that I will get from an Image Captioning model doesn't have to be the correct prompt to get the same image from stable diffusion (I hope I am explaining what I am thinking right).

Blutorangensaft t1_irvo011 wrote on October 11, 2022 at 11:59 AM

Wikipedia: " Stable Diffusion is a deep learning, text-to-image model released by startup StabilityAI in 2022. It is primarily used to generate detailed images conditioned on text descriptions"

If we take the example prompt "a photograph of an astronaut riding a horse" (see Wiki), I don't see how that is much different from an image caption. I guess the only difference is it specifies the visual medium, so eg a photo, painting, or the like.

I don't think there is a difference between prompt and caption and you might be overthinking this. However, you could always make captions sound more like prompts (if the specified medium is the only difference) by looking for specific datasets with a certain wording or manually adapting the data yourself.

MohamedRashad OP t1_irvovxd wrote on October 11, 2022 at 12:08 PM

Maybe you are right (maybe I am overthinking the problem) I will give Image Captioning another try and see if it will work.

_Arsenie_Boca_ t1_irw1ti7 wrote on October 11, 2022 at 1:57 PM

No, I believe you are right to think that an arbitrary image captioning model cannot accurately generate prompts that actually lead to a very similar image. Afterall, the prompts are very model-dependent.

Maybe you could use something similar to prompt tuning. Use a number of randomly initialized prompt embeddings, generate an image and backprop the distance between your target image and the generated image. After convergence, you can perform a nearest neighbor search to find the words closest to the embeddings.

Not sure if this has been done, but I think it should work reasonably well

MohamedRashad OP t1_irw4soy wrote on October 11, 2022 at 2:18 PM

This is actually the first idea that came to me when thinking about this problem ... Backpropgating the output image until I reach the text representation that made it happen then use the distance function to get the closest words to this representation.

My biggest problem with this idea was the variable length of input words ... The search space for the best words to describe the image will be huge if there is no limit on the number of words that I can use to describe the image.

What are your thoughts about this (I would love to hear them)?

_Arsenie_Boca_ t1_irwat46 wrote on October 11, 2022 at 3:00 PM

Thats a fair point. You would have a fixed length for the prompt.

Not sure if this makes sense but you could use an LSTM with arbitrary constant input to generate a variable-length sequence of embeddings and optimize the LSTM rather than the embeddings directly.

LetterRip t1_iryz0a5 wrote on October 12, 2022 at 2:05 AM

unless an image was generated by a specific seed and denoiser, you likely can't actually find a prompt that will generate it since there isn't a 1 to 1 mapping. You can only find 'close' images.

CremeEmotional6561 t1_irz67j8 wrote on October 12, 2022 at 3:04 AM

>get me the correct prompts to recreate the image

AFAIK, diffusion-generated images depend on both the prompt/condition and the random generator seed for the noise. The prompt may be inversible by backpropagation wrt network activations, but the random generator seed?

franciscrot t1_irx4mwf wrote on October 11, 2022 at 6:16 PM

You'd think, but I'm pretty sure no. Different models. Also different types of models, I think? Isn't most image captioning GAN?

One thing that's interesting about this q is that the diffusion models, as I understand them (not too well) do already involve a kind of "reversal" in their training - adding more and more noise to an image till it vanishes, then trying to create an image from "pure" noise.

Just in a really non mathy way, I wonder how OP imagines this accommodating rerolling? Would it provide an image seed?

Related: Can the model produce the exact same image from two slightly different prompts?

ReasonablyBadass t1_irzfwml wrote on October 12, 2022 at 4:37 AM

If stochastic noise is added in the process "reverse engineering" the prompt shouldn't be possible, eight?

Since, as per your last question, the same prompt would generate different image.

Actually, comse to think of it, don't the systems spit out multiple images for a prompt for the user to choose one?

[D] Reversing Image-to-text models to get the prompt

ReasonablyBadass t1_irvdh0j wrote on October 11, 2022 at 9:52 AM