Submitted by HPCAI-Tech t3_ysfimk in MachineLearning

Hey folks. We just release a complete open-source solution for accelerating Stable Diffusion pretraining and fine-tuning. It help reduce the pretraining cost by 6.5 times, and the hardware cost of fine-tuning by 7 times, while simultaneously speeding up the processes.

Open source address: https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/diffusion

Our codebase for the diffusion models builds heavily on OpenAI's ADM codebase , lucidrains, Stable Diffusion, Lightning and Hugging Face. Thanks for open-sourcing!

Glad to know your thoughts about our work!

40

Comments

You must log in or register to comment.

Flag_Red t1_iw1lntd wrote

It's mentioned a few times in the articles/readme for this tool that it enables fine tuning on consumer hardware. Are there any examples of doing something like this? How long of fine tuning on a 3080 (or something) does it take teach the model a new concept? What sort of dataset is needed? Comparison to something like DreamBooth?

I'd love to try fine tuning on some of the datasets I have lying around, but I'm not sure where to start, or even if it's really viable on consumer tech.

5

enryu42 t1_iw2m1nt wrote

Even without any optimizations, it is possible to fine-tune StableDiffusion on RTX 3090, even in fp32, with some effort - even with batch size 2 (precomputing latent embeddings, saving some VRAM by not storing the autoencoder params during training).

But this is definitely not a "one-button" solution, and requires more effort than using the existing tools like textual inversion/DreamBooth (which are more appropriate for the "teach the model a new concept" use-case).

3

Flag_Red t1_iw2nxte wrote

If I'm not mistaken, full fine tuning on one 3090 isn't really feasible because of training times. I haven't tried it, but I was under the impression that matching the results of a DreamBooth would take an unreasonably long time.

DreamBooth gets around this by bootstrapping a very small number of training examples to learn a single concept. But if I have a few thousand well labelled images, I should be able to do a fine tune on them (maybe with some regularisation?) and get better results.

2

enryu42 t1_iw2vlwf wrote

Oh, it is totally feasible - I'm getting smth around 2.5 training examples/second with vanilla SD without any optimizations (which translates to more than 200k per day), which is more than enough for fine-tuning.

I'd still not recommend it for teaching the model new concepts though - it is more appropriate for transferring the model to new domains (e.g. here people adapted it to anime images).

1

EmbarrassedHelp t1_iw0ebyl wrote

Yay! Now all we need are better automatic labeling tools to create datasets for the model.

3

enryu42 t1_iw2ky7u wrote

Nice! Is there a summary of optimizations compared to vanilla StableDiffusion? Looking at the code, I see it uses this instead of AdamW, and FlashAttention instead of attention. Did I miss anything?

1

LetterRip t1_iw3rucf wrote

Could you provide details on the comparison with DeepSpeed? What parameters were used etc?

Also doesn't it provide any benefit for single GPU inference?

1