Yes, it is possible to train a small diffusion model conditioned on text captions from scratch on 64x64 images or even smaller. Depending on the complexity of the model and the number of GPUs you are using, it could take anywhere from a few hours to several days. If you are
Viewing a single comment thread. View all comments