Submitted by crappr t3_11qynbp in MachineLearning

Does anyone have experience training a small diffusion model conditioned on text captions from scratch on 64x64 images or possibly even smaller?

I would like to run it only on images of text to see if it is able to render text. How long would this potentially take if I ran it on 1-2 GPUs? Is this something that’s even possible?

5

Comments

You must log in or register to comment.

rpnewc t1_jc5u6xd wrote

Check out Lucidrains great github repo. Works beautifully.

5

PM_ME_JOB_OFFER t1_jc5z6f0 wrote

Yo who IS this guy? He's got implementations for everything! How is anyone that productive?

3

femboyxx98 t1_jc601pw wrote

The actual implementation of most models is quite simple and he often reuses the same building blocks. The challenge is obtaining the dataset and actually training the models (and hyper parameter search) and he doesn’t provide any trained weights himself - it’s hard to know if his implementations even work out of the box.

9

therentedmule t1_jc86u50 wrote

Many repos are not usable and have click-bait names (e.g., palm-Rlhf)

1

bhagy7 t1_jc8rdbj wrote

Yes, it is possible to train a small diffusion model conditioned on text captions from scratch on 64x64 images or even smaller. Depending on the complexity of the model and the number of GPUs you are using, it could take anywhere from a few hours to several days. If you are

1