Submitted by cloneofsimo t3_zfkqjh in MachineLearning

​

TLDR : People uses dreambooth or textual inversion to fine-tune their own stable diffusion models. There is a better way: Use LoRA to fine-tune twice as faster, with end result being less than 4MB. Dedicated CLI, package, and pre-trained models are available at https://github.com/cloneofsimo/lora

​

fine tuned LoRA on pixar footages. Inspired by modern-disney-diffusion

​

fine tuned LoRA on pop-art style

Thanks to the generous work of Stability AI and Huggingface, so many people have enjoyed fine-tuning stable diffusion models to fit their needs and generate higher fidelity images. However, the fine-tuning process is very slow, and it is not easy to find a good balance between the number of steps and the quality of the results.

Also, the final results (fully fined-tuned model) is rather very large. Consequently, merging checkpoints to find user's best fit is painstakingly SSD-consuming process. Some people instead works with textual-inversion as an alternative for this. But clearly this is suboptimal: textual inversion only creates a small word-embedding, and the final image is not as good as a fully fine-tuned model.

I've managed to make an alternative work out pretty well with Stable Diffusion: adapters. Parameter-efficient adapation has been a thing for quite a long time now. Mainly, LoRA seems to work robustly in many scenarios according to many researches. (https://arxiv.org/abs/2112.06825, https://arxiv.org/abs/2203.16329)

LoRA was originally proposed as part of LLM's method, but this is rather model-agnostic method, as long as there is some space for low-rank tensor decomposition (which literally every linear layer has). No one seems to have tried them on Stable diffusion, other than perhaps (not sure if they did, because tey used other form of adapters) NovelAI, known as hypernetworks.

# But is it really good though?

I've tried my best to validate my answer : Yes. it's sometimes even better than fully fine-tuning. Note that even though we are fine-tuning 3MB of parameters, being even better than fully fine-tuning is not surprising : original paper's benchmark had similar results.

What do I mean by better? Well I could've used zero-shot FID score on some shifted dataset, but that would literally take years as generating 50,000 images on single 3090 device takes forever.

Instead, I've used Kernel Inception Distance (https://arxiv.org/abs/1801.01401) that has small standard deviation which I can reliably use as a metric. For the shifted dataset, I've gathered 2358 icon images and fine tuned them on 12000 steps for both fully fine-tuning and LORA fine-tuning. The end result is as follows:

​

LoRA 0.5 stands for merging only half of LoRA into original model. All initiated from Stable Diffusion version 2.0.

LoRA clearly wins full fine-tuning in terms of KID. But in the end, perceptual results are all that matters and I think end users will prove their effectiveness. I haven't had enough time to play with these to conclusively say anything about their superiority, but I did train LoRA on 3 different datasets (vector illustrations, disney style, pop-art style) which is available in my repo. End results seems pleasing enough to validate the perceptual quality.

# How fast is it?

Tested on 3090 device with 5950x cpu, LoRA takes 36 min on 12000 steps, while fully fine-tuning takes 1 hour 20 min. This is more than twice the speed. You also get to keep much of Adam memory saved + much of the parameters don't require grad so that's extra vram saved also.

Contributions are welcomed! This repo has been tested on Linux device, so if something doesn't work please leave a Issue/PR.If you've managed to train your own LoRA model, please share them!

114

Comments

You must log in or register to comment.

LetterRip t1_izdam40 wrote

Just tried this and it ran great on a 6GB VRAM card on a laptop with only 16GB of RAM (barely fit into VRAM - using bitsnbytes and xformers I think). I've only tried the corgi example but seemed to work fine. Trying it with a person now.

9

cloneofsimo OP t1_izdlve0 wrote

Glad it worked for you with such small memory constraints!

2

LetterRip t1_izdm55i wrote

> Glad it worked for you with such small memory constraints!

Currently training image size 768, and accumulation steps=2.

If steps is set to 2000, will it be going to 4000? It didn't stop at 2000 as expected and is currently over 3500, figured I'd wait till over 4000 to kill it in case the accumulation steps acts as a multiplier. (Went to 3718 and quit, right after I wrote the above).

2

Teotz t1_izjzdve wrote

Don't leave us hanging!!! :)

How did the training go with a person?

1

LetterRip t1_izksf4k wrote

It is working, but I need to use prior preservation loss, otherwise all of the words in the phrase have the concept bleed into them. So generating photos for preservation loss now.

1

LetterRip t1_izm8rkq wrote

It did work, now I can no longer launch lora training even with 768 or 512 (CUDA VRAM exceeded), only 256 no idea what changed.

1

JanssonsFrestelse t1_j0l89ve wrote

Same here with 8GB VRAM, although looks like I can't use mixed_precision=fp16 with my RTX 2070, so that might be why.

1

hentieDesu t1_izebuz3 wrote

Can you train the model with pics of people's faces like the original Dreambooth?

I will give it a try regardless. Thx! I'll update you guys with the results.

3

Why_Soooo_Serious t1_izhwqbu wrote

any guide on how to use this locally or colab?

1

sam__izdat t1_izi6p0a wrote

The repo's README is literally the first link in the post.

1

Why_Soooo_Serious t1_izia31z wrote

i did check the repo, but was hoping for an easier to follow -less technical- method, or a colab notebook like the dreambooth ones

2

sam__izdat t1_iziau6e wrote

What are you having trouble following? I'm not trying to be rude, but it's already a -less technical- method because HF's diffusers and accelerate stuff will download everything for you and set it all up. I rather it was a little more technical, because it's a bit of a black box.

I was having problems with unhelpful error messages until I updated transformers. I'm still having CUDA illegal memory access errors at the start of training, but I think that's because support for old Tesla GPUs is just fading -- had the same issue with new pytorch trying to run any SD in full precision.

1

[deleted] t1_j0e1sfq wrote

Oh fuck you man some people need more help than others, what a pathetic answer.

Remember you had to start somewhere too

1

sam__izdat t1_j0e20q4 wrote

I started by reading the documentation.

1

[deleted] t1_j0e29x1 wrote

Yeah so did I but there's a fuckton of knowledge out there and it gets overwhelming and confusing for new people trying to figure it out, what a fucking dick answer to just be like "go look for it yourself"

1

sam__izdat t1_j0e2kcd wrote

You don't have to look for it. The documentation is right there.

1

LetterRip t1_izlburh wrote

Yes you can. I haven't got great results yet, but haven't done a custom model before this.

1

ThatInternetGuy t1_izenxjo wrote

This could be a great choice between textual inversion and a full-blown Dreambooth. I think it could benefit from saving the text encoder too (about 250MB half-precision).

1

johnslegers t1_izexocv wrote

End result being less than 4MB?

So this means the finetuned content is saved separately?

What if I don't want that? What if I want it to be merged with the model, as is the case for Dreambooth training?

Is there a way to merge the trained concept with the model itself?

1

PrimaCora t1_izgzw9a wrote

It's in the repo, but yeah, they have a way to merge to models, and to merge multiple dream booth trainings into one.

1

johnslegers t1_izh1tue wrote

Oh, wow, that changes things.

Thanks for the info.

Definitely will need to check out LoRa, then...

1

yupignome t1_izfd87n wrote

this looks great, but needs more documentation, as running it as it is doesn't work

1

Desuka15 t1_izoh1qo wrote

Can you help me with this? I’m a bit lost on it. Please pm me.

1