Submitted by Just0by t3_z9q0pq in MachineLearning

Hi everyone, we just release probably the fastest Stable Diffusion. The following two pictures show that on A100 GPU, whether it is PCIe 40GB or SXM 80GB, OneFlow Stable Diffusion leads the performance results compared to other deep learning frameworks/compilers.

GitHub URL: https://github.com/Oneflow-Inc/diffusers/wiki/How-to-Run-OneFlow-Stable-Diffusion

OneFlow URL:https://github.com/Oneflow-Inc/oneflow/

​

https://preview.redd.it/z0r7tgioua3a1.png?width=612&format=png&auto=webp&s=ed1cf29d62adec7082a4cabfe35f0c0012a4a7a7

https://preview.redd.it/9nntibfpua3a1.png?width=612&format=png&auto=webp&s=b7cd03cebca7133b84d6d33bf0ac9e6cae8df4ee

Before that, On November 7th, OneFlow accelerated the Stable Diffusion to the era of "generating in one second" for the first time. On A100 SXM 80GB, OneFlow Stable Diffusion reaches a groundbreaking inference speed of 50 it/s, which means that the required 50 rounds of sampling to generate an image can be done in exactly 1 second. Now, OneFlow refreshed the SOTA record again.

You might wonder how OneFlow Stable Diffusion made this exciting result. Actually, OneFlow's compiler has played a pivotal role in accelerating the model. The compiler can allow any PyTorch frontend-built models to run faster on NVIDIA GPUs.

Welcome to try OneFlow Stable Diffusion and make your own masterpiece using Docker! all you need is to execute the following snippet:

 docker run --rm -it \
  --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
  -v ${HF_HOME}:${HF_HOME} \
  -v ${PWD}:${PWD} \
  -w ${PWD} \
  -e HF_HOME=${HF_HOME} \
  -e HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN} \
  oneflowinc/oneflow-sd:cu112 \
  python3 /demos/oneflow-t2i.py # --prompt "a photo of an astronaut riding a horse on mars"

Check out OneFlow on GitHub . We'd love to hear your feedback!

9

Comments

You must log in or register to comment.

Deep-Station-1746 t1_iyi0y4e wrote

> whether it is PCIe 40GB or SXM 80GB

Oh thank god SXM 80GB is supported! I have way too many A100 80GBs just lying around the house, this will help me find some use for them. /s

Also, I might be stretching this a bit, but uh, do you guys happen to also have an under-8GB VRAM model lying around? :)

17

SnooWalruses3638 t1_iykhjwn wrote

The improvement approach by OneFlow stable diffusion indeed works on low end consumer card.

3

plocco-tocco t1_iylno87 wrote

I thought that it was possible to load SD using around 1 GB of VRAM right?

2

Evoke_App t1_iykmn0i wrote

Amazing! This will be perfect for the Stable Diffusion API I'm currently developing.

If you're interested, check out my Discord

2

Accomplished_Sir4770 t1_iym5eio wrote

Just tested this on my 4090 FE under Windows 11 22H2 and WSL2 Ubuntu 22:

Got 43 it/s, compared to 63 it/s with AITemplate. :)
For a single 512-512 img ofc.

1

Just0by OP t1_iymiqp4 wrote

Thanks for your feedback, are you running SD2 with AITemplate?

1