Viewing a single comment thread. View all comments

samb-t t1_irvsicm wrote

If you have enough resources to train an autoregressive model then you could take advantage of knowing that these big text-to-image models are conditioned on CLIP embeddings and instead train an autoregressive model to predict prompts conditioned on CLIP image embeddings. That way there's no non-differentiable parts to bypass and the CLIP embeddings should be a pretty great descriptor of both the input image and the prompt.

If you don't have enough resources then (just thinking out loud, probably be a better way but might give some ideas) you could again use a pretrained CLIP model. 1. Embed the input image. 2. Using the CLIP text embedding network optimise the input text to get an embedding close to the image embedding. Problem there is again that text is discrete so you can't backprop. You could use gumbel softmax to approximate the discrete text values though (anneal down how continuous it is). Alternatively you could treat the embedding distance loss as an energy function, and use discrete MCMC, something like gibbs-with-gradients. But both of those options still probably aren't great, it's a horrible optimisation space