HoLeeFaak t1_irvnlrf wrote

That's a pretty hard problem, because text generation involve argmax/sampling which is not differentiable, so it's hard to optimize a model to generate text that will then be inserted as input to a text2img model to generate a given image. I guess you could do something similar to https://arxiv.org/abs/2111.14447 replacing CLIP with Stable Diffusion, changing the objective a bit, but I think it will be hard to optimize.