Viewing a single comment thread. View all comments

ReasonablyBadass t1_irvdh0j wrote

That would be Caption Generation, I believe. And has been around for a while.


MohamedRashad OP t1_irvj29w wrote

I thought about Image Captioning when I started my search but what I always found were models summarize the image not get me the correct prompts to recreate the image.


Blutorangensaft t1_irvmhb1 wrote

What do you mean with "prompt"? How is it different from a caption?


m1st3r_c t1_irvmqv8 wrote

The terms used in an image generation ai's prompt field which it uses to generate the art.


MohamedRashad OP t1_irvng87 wrote

prompt of stable diffusion (for example) is the text that will result in the image I want.

the text that I will get from an Image Captioning model doesn't have to be the correct prompt to get the same image from stable diffusion (I hope I am explaining what I am thinking right).


Blutorangensaft t1_irvo011 wrote

Wikipedia: " Stable Diffusion is a deep learning, text-to-image model released by startup StabilityAI in 2022. It is primarily used to generate detailed images conditioned on text descriptions"

If we take the example prompt "a photograph of an astronaut riding a horse" (see Wiki), I don't see how that is much different from an image caption. I guess the only difference is it specifies the visual medium, so eg a photo, painting, or the like.

I don't think there is a difference between prompt and caption and you might be overthinking this. However, you could always make captions sound more like prompts (if the specified medium is the only difference) by looking for specific datasets with a certain wording or manually adapting the data yourself.


MohamedRashad OP t1_irvovxd wrote

Maybe you are right (maybe I am overthinking the problem) I will give Image Captioning another try and see if it will work.


_Arsenie_Boca_ t1_irw1ti7 wrote

No, I believe you are right to think that an arbitrary image captioning model cannot accurately generate prompts that actually lead to a very similar image. Afterall, the prompts are very model-dependent.

Maybe you could use something similar to prompt tuning. Use a number of randomly initialized prompt embeddings, generate an image and backprop the distance between your target image and the generated image. After convergence, you can perform a nearest neighbor search to find the words closest to the embeddings.

Not sure if this has been done, but I think it should work reasonably well


MohamedRashad OP t1_irw4soy wrote

This is actually the first idea that came to me when thinking about this problem ... Backpropgating the output image until I reach the text representation that made it happen then use the distance function to get the closest words to this representation.

My biggest problem with this idea was the variable length of input words ... The search space for the best words to describe the image will be huge if there is no limit on the number of words that I can use to describe the image.


What are your thoughts about this (I would love to hear them)?


_Arsenie_Boca_ t1_irwat46 wrote

Thats a fair point. You would have a fixed length for the prompt.

Not sure if this makes sense but you could use an LSTM with arbitrary constant input to generate a variable-length sequence of embeddings and optimize the LSTM rather than the embeddings directly.


LetterRip t1_iryz0a5 wrote

unless an image was generated by a specific seed and denoiser, you likely can't actually find a prompt that will generate it since there isn't a 1 to 1 mapping. You can only find 'close' images.


CremeEmotional6561 t1_irz67j8 wrote

>get me the correct prompts to recreate the image

AFAIK, diffusion-generated images depend on both the prompt/condition and the random generator seed for the noise. The prompt may be inversible by backpropagation wrt network activations, but the random generator seed?


franciscrot t1_irx4mwf wrote

You'd think, but I'm pretty sure no. Different models. Also different types of models, I think? Isn't most image captioning GAN?

One thing that's interesting about this q is that the diffusion models, as I understand them (not too well) do already involve a kind of "reversal" in their training - adding more and more noise to an image till it vanishes, then trying to create an image from "pure" noise.

Just in a really non mathy way, I wonder how OP imagines this accommodating rerolling? Would it provide an image seed?

Related: Can the model produce the exact same image from two slightly different prompts?


ReasonablyBadass t1_irzfwml wrote

If stochastic noise is added in the process "reverse engineering" the prompt shouldn't be possible, eight?

Since, as per your last question, the same prompt would generate different image.

Actually, comse to think of it, don't the systems spit out multiple images for a prompt for the user to choose one?