Hi everyone, we just release probably the fastest Stable Diffusion. The following two pictures show that on A100 GPU, whether it is PCIe 40GB or SXM 80GB, OneFlow Stable Diffusion leads the performance results compared to other deep learning frameworks/compilers.

Before that, On November 7th, OneFlow accelerated the Stable Diffusion to the era of "generating in one second" for the first time. On A100 SXM 80GB, OneFlow Stable Diffusion reaches a groundbreaking inference speed of 50 it/s, which means that the required 50 rounds of sampling to generate an image can be done in exactly 1 second. Now, OneFlow refreshed the SOTA record again.

You might wonder how OneFlow Stable Diffusion made this exciting result. Actually, OneFlow's compiler has played a pivotal role in accelerating the model. The compiler can allow any PyTorch frontend-built models to run faster on NVIDIA GPUs.

Welcome to try OneFlow Stable Diffusion and make your own masterpiece using Docker! all you need is to execute the following snippet:

 docker run --rm -it \
  --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
  -v ${HF_HOME}:${HF_HOME} \
  -v ${PWD}:${PWD} \
  -w ${PWD} \
  -e HF_HOME=${HF_HOME} \
  -e HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN} \
  oneflowinc/oneflow-sd:cu112 \
  python3 /demos/oneflow-t2i.py # --prompt "a photo of an astronaut riding a horse on mars"

Check out OneFlow on GitHub . We'd love to hear your feedback!

Comments

Deep-Station-1746 t1_iyi0y4e wrote on December 1, 2022 at 3:38 PM

> whether it is PCIe 40GB or SXM 80GB

Oh thank god SXM 80GB is supported! I have way too many A100 80GBs just lying around the house, this will help me find some use for them. /s

Also, I might be stretching this a bit, but uh, do you guys happen to also have an under-8GB VRAM model lying around? :)

SnooWalruses3638 t1_iykhjwn wrote on December 2, 2022 at 1:42 AM

The improvement approach by OneFlow stable diffusion indeed works on low end consumer card.

plocco-tocco t1_iylno87 wrote on December 2, 2022 at 8:40 AM

I thought that it was possible to load SD using around 1 GB of VRAM right?

Evoke_App t1_iykmn0i wrote on December 2, 2022 at 2:21 AM

Amazing! This will be perfect for the Stable Diffusion API I'm currently developing.

If you're interested, check out my Discord

Accomplished_Sir4770 t1_iym5eio wrote on December 2, 2022 at 12:40 PM

Just tested this on my 4090 FE under Windows 11 22H2 and WSL2 Ubuntu 22:

Got 43 it/s, compared to 63 it/s with AITemplate. :)
For a single 512-512 img ofc.

Just0by OP t1_iymiqp4 wrote on December 2, 2022 at 2:37 PM

Thanks for your feedback, are you running SD2 with AITemplate?

Accomplished_Sir4770 t1_iymvutw wrote on December 2, 2022 at 4:09 PM

No, it was v1.5 that I tested with AIT. Also check out your discord :)

Just0by OP t1_iytxtn5 wrote on December 4, 2022 at 3:20 AM

Sure, sorry for the late reply = =

[deleted] t1_j0j9p5m wrote on December 17, 2022 at 1:37 AM

[deleted]