MohamedRashad OP

But in this case, I will need to train image captioning model on text-to-image data and hope that it will provide me with the correct prompt to recreate the image using the text-to-image model.

I think a better solution is to use the backward propagation in text-to-image models to get the prompt that made the image (an inverse state or something like it).


KlutzyLeadership3652

Don't know how feasible this would be for you but you could create a surrogate model that learns image-to-text. Use your original text-to-image model to generate images given text (open caption generation datasets can give you good examples of captions), and the surrogate model trains to generate the text/caption back. This would be model centric so don't need to worry about many2many issue mentioned above.

This can be made more robust than a backward propagation approach.