ReasonablyBadass t1_irvdh0j wrote on October 11, 2022 at 9:52 AM

That would be Caption Generation, I believe. And has been around for a while.

MohamedRashad OP t1_irvj29w wrote on October 11, 2022 at 11:06 AM

I thought about Image Captioning when I started my search but what I always found were models summarize the image not get me the correct prompts to recreate the image.

Blutorangensaft t1_irvmhb1 wrote on October 11, 2022 at 11:44 AM

What do you mean with "prompt"? How is it different from a caption?

m1st3r_c t1_irvmqv8 wrote on October 11, 2022 at 11:47 AM

The terms used in an image generation ai's prompt field which it uses to generate the art.

MohamedRashad OP t1_irvng87 wrote on October 11, 2022 at 11:54 AM

prompt of stable diffusion (for example) is the text that will result in the image I want.

the text that I will get from an Image Captioning model doesn't have to be the correct prompt to get the same image from stable diffusion (I hope I am explaining what I am thinking right).

Blutorangensaft t1_irvo011 wrote on October 11, 2022 at 11:59 AM

Wikipedia: " Stable Diffusion is a deep learning, text-to-image model released by startup StabilityAI in 2022. It is primarily used to generate detailed images conditioned on text descriptions"

If we take the example prompt "a photograph of an astronaut riding a horse" (see Wiki), I don't see how that is much different from an image caption. I guess the only difference is it specifies the visual medium, so eg a photo, painting, or the like.

I don't think there is a difference between prompt and caption and you might be overthinking this. However, you could always make captions sound more like prompts (if the specified medium is the only difference) by looking for specific datasets with a certain wording or manually adapting the data yourself.

MohamedRashad OP t1_irvovxd wrote on October 11, 2022 at 12:08 PM

Maybe you are right (maybe I am overthinking the problem) I will give Image Captioning another try and see if it will work.

_Arsenie_Boca_ t1_irw1ti7 wrote on October 11, 2022 at 1:57 PM

No, I believe you are right to think that an arbitrary image captioning model cannot accurately generate prompts that actually lead to a very similar image. Afterall, the prompts are very model-dependent.

Maybe you could use something similar to prompt tuning. Use a number of randomly initialized prompt embeddings, generate an image and backprop the distance between your target image and the generated image. After convergence, you can perform a nearest neighbor search to find the words closest to the embeddings.

Not sure if this has been done, but I think it should work reasonably well

MohamedRashad OP t1_irw4soy wrote on October 11, 2022 at 2:18 PM

This is actually the first idea that came to me when thinking about this problem ... Backpropgating the output image until I reach the text representation that made it happen then use the distance function to get the closest words to this representation.

My biggest problem with this idea was the variable length of input words ... The search space for the best words to describe the image will be huge if there is no limit on the number of words that I can use to describe the image.

What are your thoughts about this (I would love to hear them)?

_Arsenie_Boca_ t1_irwat46 wrote on October 11, 2022 at 3:00 PM

Thats a fair point. You would have a fixed length for the prompt.

Not sure if this makes sense but you could use an LSTM with arbitrary constant input to generate a variable-length sequence of embeddings and optimize the LSTM rather than the embeddings directly.

LetterRip t1_iryz0a5 wrote on October 12, 2022 at 2:05 AM

unless an image was generated by a specific seed and denoiser, you likely can't actually find a prompt that will generate it since there isn't a 1 to 1 mapping. You can only find 'close' images.

CremeEmotional6561 t1_irz67j8 wrote on October 12, 2022 at 3:04 AM

>get me the correct prompts to recreate the image

AFAIK, diffusion-generated images depend on both the prompt/condition and the random generator seed for the noise. The prompt may be inversible by backpropagation wrt network activations, but the random generator seed?

franciscrot t1_irx4mwf wrote on October 11, 2022 at 6:16 PM

You'd think, but I'm pretty sure no. Different models. Also different types of models, I think? Isn't most image captioning GAN?

One thing that's interesting about this q is that the diffusion models, as I understand them (not too well) do already involve a kind of "reversal" in their training - adding more and more noise to an image till it vanishes, then trying to create an image from "pure" noise.

Just in a really non mathy way, I wonder how OP imagines this accommodating rerolling? Would it provide an image seed?

Related: Can the model produce the exact same image from two slightly different prompts?

ReasonablyBadass t1_irzfwml wrote on October 12, 2022 at 4:37 AM

If stochastic noise is added in the process "reverse engineering" the prompt shouldn't be possible, eight?

Since, as per your last question, the same prompt would generate different image.

Actually, comse to think of it, don't the systems spit out multiple images for a prompt for the user to choose one?

milleniumsentry t1_irwa0j0 wrote on October 11, 2022 at 2:55 PM

Reverse prompt tutorial. (CLIP Interrogator)

https://www.youtube.com/watch?v=JPBtaAQ2H2Y

Keep in mind.. that there is no meta data/stored data in the image... so it can not tell you exact prompts used. It will, however, tell you how the model views the image, and how to generate similar.

MohamedRashad OP t1_irwcx4x wrote on October 11, 2022 at 3:15 PM

This is the closest thing to what I want.

Thanks

adam_jc t1_irwh173 wrote on October 11, 2022 at 3:42 PM

there is a version on Replicate you can try easily

https://replicate.com/methexis-inc/img2prompt

MohamedRashad OP t1_irwnbma wrote on October 11, 2022 at 4:24 PM

This is amazing (there is also other projects on the same idea).

Thanks a lot

milleniumsentry t1_irxphr1 wrote on October 11, 2022 at 8:28 PM

Anytime! Good luck on your endeavors!

JoeySalmons t1_irwbazh wrote on October 11, 2022 at 3:04 PM

I am really surprised it took this long for this to be mentioned/suggested. I was just about to comment about it too. Anyone who has used automatic1111's webui for Stable Diffusion would have also known about the built in CLIP interrogate feature it has, which works somewhat well for Stable Diffusion. Might also work for other txt2img models.

nmkd t1_irx0805 wrote on October 11, 2022 at 5:47 PM

Feeding the CLIP interrogator result back into Stable Diffusion results in completely different images though.

It's not good.

milleniumsentry t1_irxa3sg wrote on October 11, 2022 at 6:51 PM

No no. It only tells you what prompts it would use to generate a similar image. There is no actual prompt data accessible in the image/meta data. With millions of seeds, and billions of word combinations, you wouldn't be able to reverse engineer it.

I think having an embed for those interested would be a great step. Then you could just read the file and go from there.

visarga t1_irziac5 wrote on October 12, 2022 at 5:02 AM

Now is the time to convince everyone to embed the prompt data in the generated images, since the trend is just starting. Could be also useful later when we crawl the web, to separate real from generated images.

milleniumsentry t1_is13giv wrote on October 12, 2022 at 3:15 PM

I honestly think this will be a step in the right direction. Not actually for prompt sharing, but for refinement. These networks will start off great at telling you.. that's a hippo.... that's a potato.. but what happens when someone wants to create a hippotato...

I think without some sort of tagging/self reference, the data runs to risk of self reinforcement... as the main function of the task is to bash a few things together into something else. At what point will it need extra information so that it knows, yes.. this is what they wanted... this is a good representation of the task...

A tag back loop would be phenomenal. Imagine if you ask for a robotic cow with an astronaut friend. Some of those image, will be lacking robot features, some won't look like cows... etc. Ideally, your finished piece would be tagged as well... but perhaps missing the astronaut... or another part of the initial prompt request. By removing tags that were not generated by the prompt, the two can be compared for a soft 'success' rate.

vman512 t1_irw4qjv wrote on October 11, 2022 at 2:18 PM

I think the most straightforward way to solve this is to generate a dataset of text->image with the diffusion model, and then learn the inverse function with a new model. But you'd need a gigantic dataset for this to work.

Diffusion models have quite diverse outputs, even given the same prompt. Maybe what your asking for is, given an image, and a random seed, design a prompt that replicates the image as close as possible?

In that case, you can imagine each image->text inference as an optimization problem, and use a deep-dream style loss to optimize for the best prompt. It may be helpful to first use this method to select best latent encoding of the text, and then figure out how to learn the inverse function for the text embedding

visarga t1_irzidqj wrote on October 12, 2022 at 5:03 AM

> you'd need a gigantic dataset for this to work

If that's the problem then OP can use Lexica.art to search their huge database with a picture (they use CLIP), then lift the prompts from the top results. I think they even have an API. But the matching images can be quite different.

KingsmanVince t1_irvl5k5 wrote on October 11, 2022 at 11:30 AM

That's called Image Captioning

MohamedRashad OP t1_irvlt15 wrote on October 11, 2022 at 11:37 AM

Image Captioning doesn't have to provide the prompt that makes the image.

KingsmanVince t1_irvmgnx wrote on October 11, 2022 at 11:44 AM

In Image Captioning, to train the model, you have to provide any text that describe the images. By this definition, "the prompt that makes the image" does FALL IN. One text can produce many images. One image can be described by many texts. Image and Text have many2many relationships.

For example, to capture a picture of a running dog, people can describe the whole process. That still a caption.

For example, I prompt "running dog". Dalle 2 draws a running dog me. Yes that's a freaking caption.

m1st3r_c t1_irvn1d5 wrote on October 11, 2022 at 11:50 AM

OP is looking for a way to take a piece of AI generated art and reverse engineer the model that created it, to find out what prompt terms and weightings etc were used to create it.

MohamedRashad OP t1_irvmz2q wrote on October 11, 2022 at 11:49 AM

But in this case, I will need to train image captioning model on text-to-image data and hope that it will provide me with the correct prompt to recreate the image using the text-to-image model.

I think a better solution is to use the backward propagation in text-to-image models to get the prompt that made the image (an inverse state or something like it).

KlutzyLeadership3652 t1_irwt908 wrote on October 11, 2022 at 5:03 PM

Don't know how feasible this would be for you but you could create a surrogate model that learns image-to-text. Use your original text-to-image model to generate images given text (open caption generation datasets can give you good examples of captions), and the surrogate model trains to generate the text/caption back. This would be model centric so don't need to worry about many2many issue mentioned above.

This can be made more robust than a backward propagation approach.

BaconRaven t1_irwowua wrote on October 11, 2022 at 4:35 PM

If you can do this, you can invent glasses for blind people that describes the world by taking an image and describing it in text or even read to them.

MohamedRashad OP t1_irws3jq wrote on October 11, 2022 at 4:55 PM

This is a nice project to be made.

HoLeeFaak t1_irvnlrf wrote on October 11, 2022 at 11:55 AM

That's a pretty hard problem, because text generation involve argmax/sampling which is not differentiable, so it's hard to optimize a model to generate text that will then be inserted as input to a text2img model to generate a given image. I guess you could do something similar to https://arxiv.org/abs/2111.14447 replacing CLIP with Stable Diffusion, changing the objective a bit, but I think it will be hard to optimize.

MohamedRashad OP t1_irvolp8 wrote on October 11, 2022 at 12:05 PM

I thought about self-supervision for this task. Enter the image I want it's prompt to an Image-to-text model and the resulting text I feed to a diffusion model (DALL-E, Stable Diffusion) which I freeze their weights so they don't change.

The output image will be compared to the original image I entered and the loss will be backpropagated to the image-to-text model to learn. The problems with this approach (in my humble opinion) are two:

Training such system won't be easy and I will need a lot of resources I currently don't have.
And even if I succeed The resulting model won't be good enough for generalization.

This is of course if I managed to overcome the non-differentiable parts.

HoLeeFaak t1_irvoxe5 wrote on October 11, 2022 at 12:09 PM

What you propose is a cycle-loss. It's valid, but the biggest problem is the non-differentiable parts, and this is a big problem that I didn't find a solution to.

samb-t t1_irvsicm wrote on October 11, 2022 at 12:42 PM

If you have enough resources to train an autoregressive model then you could take advantage of knowing that these big text-to-image models are conditioned on CLIP embeddings and instead train an autoregressive model to predict prompts conditioned on CLIP image embeddings. That way there's no non-differentiable parts to bypass and the CLIP embeddings should be a pretty great descriptor of both the input image and the prompt.

If you don't have enough resources then (just thinking out loud, probably be a better way but might give some ideas) you could again use a pretrained CLIP model. 1. Embed the input image. 2. Using the CLIP text embedding network optimise the input text to get an embedding close to the image embedding. Problem there is again that text is discrete so you can't backprop. You could use gumbel softmax to approximate the discrete text values though (anneal down how continuous it is). Alternatively you could treat the embedding distance loss as an energy function, and use discrete MCMC, something like gibbs-with-gradients. But both of those options still probably aren't great, it's a horrible optimisation space

[deleted] t1_irvowy0 wrote on October 11, 2022 at 12:09 PM

[deleted]

aiccount t1_irxpasp wrote on October 11, 2022 at 8:27 PM

I hooked up Disco Diffusion in a feedback loop with Antarctic Captions. You can see their different interpretations in the way the image morphs.

https://youtu.be/GuxVc-eHrWs

MohamedRashad OP t1_is2b60w wrote on October 12, 2022 at 8:00 PM

Those are some interesting results.

trutheality t1_irwzq5f wrote on October 11, 2022 at 5:44 PM

There are a few names for that including image captioning, (automatic) scene recognition, scene classification, scene analysis. It's a much older task than image generation from text so there are quite a lot of papers about it.

Infinitesima t1_irxbmdo wrote on October 11, 2022 at 7:00 PM

Soon there will be 'Anti-reverse obfuscation' tool/technique I guess

Apprehensive-Grade81 t1_irws8q8 wrote on October 11, 2022 at 4:56 PM

Just out of curiosity, have you tried training a model on images to get text prompts? I can’t imagine this would be too difficult to try out.

freezelikeastatue t1_irx5s21 wrote on October 11, 2022 at 6:23 PM

Have you looked at the images meta data? I know it sounds stupid but I know some of the image generators attach that info either in the watermark data or metadata. I could be wrong tho…0

AnOnlineHandle t1_irxicvd wrote on October 11, 2022 at 7:43 PM

You might also want to look into textual inversion, to create a new pseudo word for a concept you want to describe given a few reference images.

SnowyNW t1_irxnew9 wrote on October 11, 2022 at 8:15 PM

Could you get a non generated image and figure out exactly what to input as a prompt to receive it as a result?

MohamedRashad OP t1_is2bjp2 wrote on October 12, 2022 at 8:02 PM

This is my core question actually and it's a very hard one.

SnowyNW t1_is4809e wrote on October 13, 2022 at 4:30 AM

Please keep me updated as I am searching for the same solution

windowpanez t1_iry3lbr wrote on October 11, 2022 at 10:03 PM

For stable diffusion's webui, there is the "interrogate" feature which will try to get the prompts from an image which under the hood uses an algorithm called BLIP [demo] [paper]

its_pizza_parker t1_irz40dw wrote on October 12, 2022 at 2:46 AM

Probs closest thing to what you want https://github.com/pharmapsychotic/clip-interrogator/blob/main/clip_interrogator.ipynb

-zharai t1_irzsfgh wrote on October 12, 2022 at 7:11 AM

I don't think it's possible to have a high degree of certainty, at least with diffusion models. There is too much information lost, and too much noise injected.

E.g. how can you know, with any decent certainty, which features of the image were described in the prompt? And further, if you manage to know which information in the image is specified, and which is improvised by the model, there are so many ways to describe the same information.

purple-cottage2134 t1_irzndc9 wrote on October 12, 2022 at 6:03 AM

Newbies need direction and these experts are today's age influencers. These experts may not be the best in the world, but they sure are bringing about an impact in the industry. Here's to the top growing ML & DS experts, and here's to the future of ML- https://engatica.com/blog/top-50-machine-learning-and-data-science-experts-to-follow-for-2023?contentId=634551c86f56fd1389e92c50

Comments