Viewing a single comment thread. View all comments

farmingvillein t1_ixncbjn wrote

Check out https://twitter.com/cxbln/status/1595652302123454464:

> 🎉Introducing RoentGen, a generative vision-language foundation model based on #StableDiffusion, fine-tuned on a large chest x-ray and radiology report dataset, and controllable through text prompts!

(Not your full problem, but you may find it helpful!)

More generally, you could probably use the same image-to-text techniques that get used to validate a stablediffusion model.

Or for a really quick-and-dirty solution, you could try using a model like theirs to generate training data (image, text pairs) and train an image => txt model (which they do a variant of in the paper).

1