vman512 t1_irw4qjv wrote on October 11, 2022 at 2:18 PM

I think the most straightforward way to solve this is to generate a dataset of text->image with the diffusion model, and then learn the inverse function with a new model. But you'd need a gigantic dataset for this to work.

Diffusion models have quite diverse outputs, even given the same prompt. Maybe what your asking for is, given an image, and a random seed, design a prompt that replicates the image as close as possible?

In that case, you can imagine each image->text inference as an optimization problem, and use a deep-dream style loss to optimize for the best prompt. It may be helpful to first use this method to select best latent encoding of the text, and then figure out how to learn the inverse function for the text embedding

visarga t1_irzidqj wrote on October 12, 2022 at 5:03 AM

> you'd need a gigantic dataset for this to work

If that's the problem then OP can use Lexica.art to search their huge database with a picture (they use CLIP), then lift the prompts from the top results. I think they even have an API. But the matching images can be quite different.