Submitted by MohamedRashad t3_y14lvd in MachineLearning

I am looking for research papers in this area and I am unable to find anything.

The idea is that I give the model an image and he spits out the text that creates it with high confidence. I think prompt engineering can be the closest thing to what I want but when I searched the latest papers in it I got nothing useful from them.


What keywords should I use ? or are there any good papers or tools I need to know about ?

Any help will be appreciated, Thanks in advance.



You must log in or register to comment.

ReasonablyBadass t1_irvdh0j wrote

That would be Caption Generation, I believe. And has been around for a while.


MohamedRashad OP t1_irvj29w wrote

I thought about Image Captioning when I started my search but what I always found were models summarize the image not get me the correct prompts to recreate the image.


Blutorangensaft t1_irvmhb1 wrote

What do you mean with "prompt"? How is it different from a caption?


m1st3r_c t1_irvmqv8 wrote

The terms used in an image generation ai's prompt field which it uses to generate the art.


MohamedRashad OP t1_irvng87 wrote

prompt of stable diffusion (for example) is the text that will result in the image I want.

the text that I will get from an Image Captioning model doesn't have to be the correct prompt to get the same image from stable diffusion (I hope I am explaining what I am thinking right).


Blutorangensaft t1_irvo011 wrote

Wikipedia: " Stable Diffusion is a deep learning, text-to-image model released by startup StabilityAI in 2022. It is primarily used to generate detailed images conditioned on text descriptions"

If we take the example prompt "a photograph of an astronaut riding a horse" (see Wiki), I don't see how that is much different from an image caption. I guess the only difference is it specifies the visual medium, so eg a photo, painting, or the like.

I don't think there is a difference between prompt and caption and you might be overthinking this. However, you could always make captions sound more like prompts (if the specified medium is the only difference) by looking for specific datasets with a certain wording or manually adapting the data yourself.


MohamedRashad OP t1_irvovxd wrote

Maybe you are right (maybe I am overthinking the problem) I will give Image Captioning another try and see if it will work.


_Arsenie_Boca_ t1_irw1ti7 wrote

No, I believe you are right to think that an arbitrary image captioning model cannot accurately generate prompts that actually lead to a very similar image. Afterall, the prompts are very model-dependent.

Maybe you could use something similar to prompt tuning. Use a number of randomly initialized prompt embeddings, generate an image and backprop the distance between your target image and the generated image. After convergence, you can perform a nearest neighbor search to find the words closest to the embeddings.

Not sure if this has been done, but I think it should work reasonably well


MohamedRashad OP t1_irw4soy wrote

This is actually the first idea that came to me when thinking about this problem ... Backpropgating the output image until I reach the text representation that made it happen then use the distance function to get the closest words to this representation.

My biggest problem with this idea was the variable length of input words ... The search space for the best words to describe the image will be huge if there is no limit on the number of words that I can use to describe the image.


What are your thoughts about this (I would love to hear them)?


_Arsenie_Boca_ t1_irwat46 wrote

Thats a fair point. You would have a fixed length for the prompt.

Not sure if this makes sense but you could use an LSTM with arbitrary constant input to generate a variable-length sequence of embeddings and optimize the LSTM rather than the embeddings directly.


LetterRip t1_iryz0a5 wrote

unless an image was generated by a specific seed and denoiser, you likely can't actually find a prompt that will generate it since there isn't a 1 to 1 mapping. You can only find 'close' images.


CremeEmotional6561 t1_irz67j8 wrote

>get me the correct prompts to recreate the image

AFAIK, diffusion-generated images depend on both the prompt/condition and the random generator seed for the noise. The prompt may be inversible by backpropagation wrt network activations, but the random generator seed?


franciscrot t1_irx4mwf wrote

You'd think, but I'm pretty sure no. Different models. Also different types of models, I think? Isn't most image captioning GAN?

One thing that's interesting about this q is that the diffusion models, as I understand them (not too well) do already involve a kind of "reversal" in their training - adding more and more noise to an image till it vanishes, then trying to create an image from "pure" noise.

Just in a really non mathy way, I wonder how OP imagines this accommodating rerolling? Would it provide an image seed?

Related: Can the model produce the exact same image from two slightly different prompts?


ReasonablyBadass t1_irzfwml wrote

If stochastic noise is added in the process "reverse engineering" the prompt shouldn't be possible, eight?

Since, as per your last question, the same prompt would generate different image.

Actually, comse to think of it, don't the systems spit out multiple images for a prompt for the user to choose one?


milleniumsentry t1_irwa0j0 wrote

Reverse prompt tutorial. (CLIP Interrogator)

Keep in mind.. that there is no meta data/stored data in the image... so it can not tell you exact prompts used. It will, however, tell you how the model views the image, and how to generate similar.


MohamedRashad OP t1_irwcx4x wrote

This is the closest thing to what I want.



JoeySalmons t1_irwbazh wrote

I am really surprised it took this long for this to be mentioned/suggested. I was just about to comment about it too. Anyone who has used automatic1111's webui for Stable Diffusion would have also known about the built in CLIP interrogate feature it has, which works somewhat well for Stable Diffusion. Might also work for other txt2img models.


nmkd t1_irx0805 wrote

Feeding the CLIP interrogator result back into Stable Diffusion results in completely different images though.

It's not good.


milleniumsentry t1_irxa3sg wrote

No no. It only tells you what prompts it would use to generate a similar image. There is no actual prompt data accessible in the image/meta data. With millions of seeds, and billions of word combinations, you wouldn't be able to reverse engineer it.

I think having an embed for those interested would be a great step. Then you could just read the file and go from there.


visarga t1_irziac5 wrote

Now is the time to convince everyone to embed the prompt data in the generated images, since the trend is just starting. Could be also useful later when we crawl the web, to separate real from generated images.


milleniumsentry t1_is13giv wrote

I honestly think this will be a step in the right direction. Not actually for prompt sharing, but for refinement. These networks will start off great at telling you.. that's a hippo.... that's a potato.. but what happens when someone wants to create a hippotato...

I think without some sort of tagging/self reference, the data runs to risk of self reinforcement... as the main function of the task is to bash a few things together into something else. At what point will it need extra information so that it knows, yes.. this is what they wanted... this is a good representation of the task...

A tag back loop would be phenomenal. Imagine if you ask for a robotic cow with an astronaut friend. Some of those image, will be lacking robot features, some won't look like cows... etc. Ideally, your finished piece would be tagged as well... but perhaps missing the astronaut... or another part of the initial prompt request. By removing tags that were not generated by the prompt, the two can be compared for a soft 'success' rate.


vman512 t1_irw4qjv wrote

I think the most straightforward way to solve this is to generate a dataset of text->image with the diffusion model, and then learn the inverse function with a new model. But you'd need a gigantic dataset for this to work.

Diffusion models have quite diverse outputs, even given the same prompt. Maybe what your asking for is, given an image, and a random seed, design a prompt that replicates the image as close as possible?

In that case, you can imagine each image->text inference as an optimization problem, and use a deep-dream style loss to optimize for the best prompt. It may be helpful to first use this method to select best latent encoding of the text, and then figure out how to learn the inverse function for the text embedding


visarga t1_irzidqj wrote

> you'd need a gigantic dataset for this to work

If that's the problem then OP can use to search their huge database with a picture (they use CLIP), then lift the prompts from the top results. I think they even have an API. But the matching images can be quite different.


KingsmanVince t1_irvl5k5 wrote

That's called Image Captioning


MohamedRashad OP t1_irvlt15 wrote

Image Captioning doesn't have to provide the prompt that makes the image.


KingsmanVince t1_irvmgnx wrote

In Image Captioning, to train the model, you have to provide any text that describe the images. By this definition, "the prompt that makes the image" does FALL IN. One text can produce many images. One image can be described by many texts. Image and Text have many2many relationships.

For example, to capture a picture of a running dog, people can describe the whole process. That still a caption.

For example, I prompt "running dog". Dalle 2 draws a running dog me. Yes that's a freaking caption.


m1st3r_c t1_irvn1d5 wrote

OP is looking for a way to take a piece of AI generated art and reverse engineer the model that created it, to find out what prompt terms and weightings etc were used to create it.


MohamedRashad OP t1_irvmz2q wrote

But in this case, I will need to train image captioning model on text-to-image data and hope that it will provide me with the correct prompt to recreate the image using the text-to-image model.

I think a better solution is to use the backward propagation in text-to-image models to get the prompt that made the image (an inverse state or something like it).


KlutzyLeadership3652 t1_irwt908 wrote

Don't know how feasible this would be for you but you could create a surrogate model that learns image-to-text. Use your original text-to-image model to generate images given text (open caption generation datasets can give you good examples of captions), and the surrogate model trains to generate the text/caption back. This would be model centric so don't need to worry about many2many issue mentioned above.

This can be made more robust than a backward propagation approach.


BaconRaven t1_irwowua wrote

If you can do this, you can invent glasses for blind people that describes the world by taking an image and describing it in text or even read to them.


HoLeeFaak t1_irvnlrf wrote

That's a pretty hard problem, because text generation involve argmax/sampling which is not differentiable, so it's hard to optimize a model to generate text that will then be inserted as input to a text2img model to generate a given image. I guess you could do something similar to replacing CLIP with Stable Diffusion, changing the objective a bit, but I think it will be hard to optimize.


MohamedRashad OP t1_irvolp8 wrote

I thought about self-supervision for this task. Enter the image I want it's prompt to an Image-to-text model and the resulting text I feed to a diffusion model (DALL-E, Stable Diffusion) which I freeze their weights so they don't change.

The output image will be compared to the original image I entered and the loss will be backpropagated to the image-to-text model to learn. The problems with this approach (in my humble opinion) are two:

  1. Training such system won't be easy and I will need a lot of resources I currently don't have.
  2. And even if I succeed The resulting model won't be good enough for generalization.

This is of course if I managed to overcome the non-differentiable parts.


HoLeeFaak t1_irvoxe5 wrote

What you propose is a cycle-loss. It's valid, but the biggest problem is the non-differentiable parts, and this is a big problem that I didn't find a solution to.


samb-t t1_irvsicm wrote

If you have enough resources to train an autoregressive model then you could take advantage of knowing that these big text-to-image models are conditioned on CLIP embeddings and instead train an autoregressive model to predict prompts conditioned on CLIP image embeddings. That way there's no non-differentiable parts to bypass and the CLIP embeddings should be a pretty great descriptor of both the input image and the prompt.

If you don't have enough resources then (just thinking out loud, probably be a better way but might give some ideas) you could again use a pretrained CLIP model. 1. Embed the input image. 2. Using the CLIP text embedding network optimise the input text to get an embedding close to the image embedding. Problem there is again that text is discrete so you can't backprop. You could use gumbel softmax to approximate the discrete text values though (anneal down how continuous it is). Alternatively you could treat the embedding distance loss as an energy function, and use discrete MCMC, something like gibbs-with-gradients. But both of those options still probably aren't great, it's a horrible optimisation space


trutheality t1_irwzq5f wrote

There are a few names for that including image captioning, (automatic) scene recognition, scene classification, scene analysis. It's a much older task than image generation from text so there are quite a lot of papers about it.


Infinitesima t1_irxbmdo wrote

Soon there will be 'Anti-reverse obfuscation' tool/technique I guess


Apprehensive-Grade81 t1_irws8q8 wrote

Just out of curiosity, have you tried training a model on images to get text prompts? I can’t imagine this would be too difficult to try out.


freezelikeastatue t1_irx5s21 wrote

Have you looked at the images meta data? I know it sounds stupid but I know some of the image generators attach that info either in the watermark data or metadata. I could be wrong tho…0


AnOnlineHandle t1_irxicvd wrote

You might also want to look into textual inversion, to create a new pseudo word for a concept you want to describe given a few reference images.


SnowyNW t1_irxnew9 wrote

Could you get a non generated image and figure out exactly what to input as a prompt to receive it as a result?


MohamedRashad OP t1_is2bjp2 wrote

This is my core question actually and it's a very hard one.


SnowyNW t1_is4809e wrote

Please keep me updated as I am searching for the same solution


windowpanez t1_iry3lbr wrote

For stable diffusion's webui, there is the "interrogate" feature which will try to get the prompts from an image which under the hood uses an algorithm called BLIP [demo] [paper]


-zharai t1_irzsfgh wrote

I don't think it's possible to have a high degree of certainty, at least with diffusion models. There is too much information lost, and too much noise injected.

E.g. how can you know, with any decent certainty, which features of the image were described in the prompt? And further, if you manage to know which information in the image is specified, and which is improvised by the model, there are so many ways to describe the same information.