vman512

vman512 t1_irw4qjv wrote

I think the most straightforward way to solve this is to generate a dataset of text->image with the diffusion model, and then learn the inverse function with a new model. But you'd need a gigantic dataset for this to work.

Diffusion models have quite diverse outputs, even given the same prompt. Maybe what your asking for is, given an image, and a random seed, design a prompt that replicates the image as close as possible?

In that case, you can imagine each image->text inference as an optimization problem, and use a deep-dream style loss to optimize for the best prompt. It may be helpful to first use this method to select best latent encoding of the text, and then figure out how to learn the inverse function for the text embedding

5