Submitted by pommedeterresautee t3_ydqmjp in MachineLearning

We are releasing Kernl under Apache 2 license, a library to make PyTorch models inference significantly faster. With 1 line of code we applied the optimizations and made Bert up to 12X faster than Hugging Face baseline. T5 is also covered in this first release (> 6X speed up generation and we are still halfway in the optimizations!). This has been possible because we wrote custom GPU kernels with the new OpenAI programming language Triton and leveraged TorchDynamo.

Project link:

E2E demo notebooks: XNLI classification, T5 generation

Benchmarks ran on a 3090 RTX GPU, 12 cores Intel CPU, more info below

On long sequence length inputs, Kernl is most of the time the fastest inference engine, and close to Nvidia TensorRT on shortest ones. Keep in mind that Bert is one of the most optimized models out there and most of the tools listed above are very mature.

What is interesting is not that Kernl is the fastest engine (or not), but that the code of the kernels is short and easy to understand and modify. We have even added a Triton debugger and a tool (based on Fx) to ease kernel replacement so there is no need to modify PyTorch model source code.

Staying in the comfort of PyTorch / Python maintains dynamic behaviors, debugging and iteration speed. Teams designing/training a transformer model (even custom) can take care of the deployment without relying on advanced GPU knowledge (eg. CUDA programming, dedicated inference engine API, etc.).

Recently released models relying on slightly modified transformer architectures are rarely accelerated in traditional inference engines, we need to wait months to years for someone (usually inference engine maintainers) to write required custom CUDA kernels. Because here custom kernels are written in OpenAI Triton language, anyone without CUDA experience can easily modify them: OpenAI Triton API is simple and close to Numpy one. Kernels source code is significantly shorter than equivalent implementation in CUDA (< 200 LoC per kernel). Basic knowledge of how GPU works is enough. We are also releasing a few tutorials we initially wrote for onboarding colleagues on the project. We hope you will find them useful: In particular, there is:

And best of the best, because we stay in the PyTorch / Python ecosystem, we plan in our roadmap to also enable training with those custom kernels. In particular Flash attention kernel should bring a 2-4X speed up and the support of very long sequences on single GPU (paper authors went as far as 16K tokens instead of traditional 512 or 2048 limits)! See below for more info.

IMPORTANT: Benchmarking is a difficult art, we tried to be as fair as possible. Please note that:

  • Timings are based on wall-clock times and we show speedup over baseline as they are easier to compare between input shapes,
  • When we need to choose between speed and output precision, we always choose precision
  • HF baseline, CUDA graphs, Inductor and Kernl are in mixed precision, AITemplate, ONNX Runtime, DeepSpeed and TensorRT have their weights converted to FP16.
  • Accumulation is done in FP32 for AITemplate and Kernl. TensorRT is likely doing it in FP16.
  • CUDA graphs is enabled for all engines except baseline, Nvfuser and ONNX Runtime which has a limited support of it.
  • For Kernl and AITemplate, fast GELU has been manually disabled (TensorRT is likely using Fast GELU).
  • AITemplate measures are to be taken with a grain of salt, it doesn’t manage attention mask which means 1/ batch inference can’t be used in most scenarios (no padding support), 2/ it misses few operations on a kernel that can be compute-bounded (depends of sequence length), said otherwise it may make it slower to support attention mask, in particular on long sequences. AITemplate attention mask support will come in a future release.
  • For TensorRT for best perf, we built 3 models, one per batch size. AITemplate will support dynamic shapes in a future release, so we made a model per input shape.
  • Inductor is in prototype stage, performances may be improved when released, none of the disabled by default optimizations worked during our tests.

As you can see, CUDA graphs erase all CPU overhead (Python related for instance), sometimes there is no need to rely on C++/Rust to be fast! Fused kernels (in CUDA or Triton) are mostly important for longer input sequence lengths. We are aware that there are still some low hanging fruits to improve Kernl performance without sacrificing output precision, it’s just the first release. More info about how it works here.


We work for Lefebvre Sarrut, a leading European legal publisher. Several of our products include transformer models in latency sensitive scenarios (search, content recommendation). So far, ONNX Runtime and TensorRT served us well, and we learned interesting patterns along the way that we shared with the community through an open-source library called transformer-deploy. However, recent changes in our environment made our needs evolve:

  • New teams in the group are deploying transformer models in prod directly with PyTorch. ONNX Runtime poses them too many challenges (like debugging precision issues in fp16). With its inference expert-oriented API, TensorRT was not even an option;
  • We are exploring applications of large generative language models in legal industry, and we need easier dynamic behavior support plus more efficient quantization, our creative approaches for that purpose we shared here on Reddit proved to be more fragile than we initially thought;
  • New business opportunities if we were able to train models supporting large contexts (>5K tokens)

On a more personal note, I enjoyed much more writing kernels and understanding low level computation of transformers than mastering multiple complicated tools API and their environments. It really changed my intuitions and understanding about how the model works, scales, etc. It’s not just OpenAI Triton, we also did some prototyping on C++ / CUDA / Cutlass and the effect was the same, it’s all about digging to a lower level. And still the effort is IMO quite limited regarding the benefits. If you have some interest in machine learning engineering, you should probably give those tools a try.


Our road map includes the following elements (in no particular order):

  • Faster warmup
  • Ragged inference (no computation lost in padding)
  • Training support (with long sequences support)
  • Multi GPU (multiple parallelization schemas support)
  • Quantization (PTQ)
  • New batch of Cutlass kernels tests
  • Improve hardware support (>= Ampere for now)
  • More tuto

Regarding training, if you want to help, we have written an issue with all the required pointers, it should be very doable:

On top of speed, one of the main benefits is the support of very long sequences (16K tokens without changing attention formula) as it’s based on Flash Attention.

Also, note that future version of PyTorch will include Inductor. It means that all PyTorch users will have the option to compile to Triton to get around 1.7X faster training.

A big thank you to Nvidia people who advised us during this project.



You must log in or register to comment.

ptillet t1_ittr7rx wrote

As the creator/maintainer of Triton, I find this very exciting! Thanks for putting in all that work, and sorry for all the bugs you may have faced along the way -- we are working hard on re-designing the whole thing to make it more stable in the long run!

>On a more personal note, I enjoyed much more writing kernels andunderstanding low level computation of transformers than masteringmultiple complicated tools API and their environments.

This is exactly why I started the project in the first place, and it is very rewarding to read this. Really glad that this project has helped people gain a deeper understanding of how neural networks computations get parallelized for execution on GPUs. :-)


pommedeterresautee OP t1_ittsubj wrote

Thank you a lot for *your* work and your message :-)

Regarding the bugs, for now they have been mostly workable, we follow with lots of excitement the MLIR rewriting and try to prepare ourselves.

I am really wondering what will happen to the ML community when Pytorch will release TorchDynamo / Inductor and so many people will start using Triton in their day to day work. Then tens of thousands of people or more with different backgrounds may start writing kernels...

As they say, what a time to be alive!


ginsunuva t1_ittv3ew wrote

So now we have TensorRT on the Triton inference server, and Triton on the Kernl inference server


pommedeterresautee OP t1_ittvxdq wrote

We are actively looking for new names that could make things even more confusing.

If you have some ideas, please share them with us 🙃


juliensalinas t1_ittw69q wrote

Congrats and thanks a lot u/pommedeterresautee for this amazing project. As usual, your in-depth explanations about low level machine learning are very insightful.

Transformer Deploy was already very exciting, and this new project seems even more promising!

Can't wait to try it for real and see if we can use it behind NLP Cloud somehow.


pommedeterresautee OP t1_ittwob9 wrote

Thank you Julien for your kind message.

We would be very happy to receive your feedback in the context of NLP cloud SAAS, like does it cover some of your needs, what you would expect that is not yet here and not in the roadmap, pesky bugs, etc.


ganzzahl t1_ittx10s wrote

Any work on using this with autoregressive encoder-decoder transformers? I always love reading and trying out your work!


pommedeterresautee OP t1_itty9y4 wrote

Yes we are!

In the post there is a link to T5 notebook, we did a rapid test, and speedup on T5 is really high (6X). It's just the beginning. Existing kernels probably already works with most generative languages (like GPT2, etc.), we just need to write replacement patterns (to search the PyTorch part and replace it with our kernels).

T5 notebook :

We are currently working on RMSNorm, a kind of simplified LayerNorm used in T5 (kernel done and merged, we are focusing on the replacement pattern).

Quite surprisingly, RMSNorm bring a huge unexpected speedup on what we already had! If you want to follow this work:

If you can't wait to use those kernels on your model, there is a part in the README of the project which explains how to write replacement pattern, it should be quite easy.


ganzzahl t1_ittzcuq wrote

Ahh, I missed that when reading your post. What a time to be alive!

My quick question for you is just this: Why is it that we don't see any projects with similar speedups using custom CUDA kernels or custom ONNX operators? Is there any inherent speed advantage of using Triton or is the high barrier of entry to writing CUDA kernels the reason that no one has "gotten around" to doing something like this in pure CUDA?


pommedeterresautee OP t1_itu4mr6 wrote

>Why is it that we don't see any projects with similar speedups using custom CUDA kernels or custom ONNX operators?

To be honest, we had the very same question :-)

CUDA is powerful... and verbose. To target several generations of hardware you need some deep knowledge of their characteristics. I have many times followed people from Microsoft on a PR implementing some new model, it takes them often 1 month or more. On TensorRT I suppose it's even harder as they generate code but hey, it's a black box. For best perf, CUDA code could be good, but you need nvcc to generate the right set of PTX instructions to reach peak perf which is not always the case from what I saw.

Hopefully, people of Nvidia working on Cutlass try to make those things easier by taking care of the lowest level of Cuda implementations. The lib is not, right now, what you would call, easy to grasp but you really learn a lot by working with it (much more than starting from scratch as you see what is the right way to implement stuff).

There are several reasons why you don't see more Triton:

- many people work with it but not in OSS (Anthropic, OpenAI, etc.). You can guess through issues and repo stars that the language is growing faster and faster since a few months

- educative material ... could be more smooth, it's a bit first tuto (add 2 vecs) is boringly simple, on matmul one there is a block you need to look during long minutes to understand what it does, and fused attention, it took us days to understand each line... and realize that it was not really the Flash Attention paper (like one of us implemented the paper, the other worked on Triton example and we were arguing during days about everything until we realized that it was not parallelized at the same level...).

Things will change, PyTorch has choose Triton language as their default one to compile GPU models for future PyTorch version (I guess version 1.14, not sure). More about it here ->

There are certainly other reasons (like big corps can't rely on other big corps techno without some guarantees, etc.) but I think those above are very important explanations.

To be honest, we have been very surprised by the speedups ourselves, beating TensorRT on long sequences was definitely far above our objectives. Even more crazy when you think we have still margins for more speedups... (like we don't yet tuned blocks sizes on some kernels, etc.)

Let's see where it brings us...


ganzzahl t1_itwsfkh wrote

This is perhaps an entirely dumb question that I will be able to answer for myself after I read through the Triton docs, but I'll ask anyway: Could one implement custom ONNX operators using Triton, or can it only be used in a Python environment?


nbviewerbot t1_ittyag6 wrote

I see you've posted a GitHub link to a Jupyter Notebook! GitHub doesn't render large Jupyter Notebooks, so just in case, here is an nbviewer link to the notebook:

Want to run the code yourself? Here is a binder link to start your own Jupyter server and try it out!

^(I am a bot.) ^(Feedback) ^(|) ^(GitHub) ^(|) ^(Author)


sam__izdat t1_ittxs83 wrote

> 8.0 (Ampere) or higher is required to install kernl

sobs softly in Kepler


pommedeterresautee OP t1_ittyyn3 wrote

Kepler gen is a bit old, but we may increase hardware support in the future.

First Triton is going through a big rewriting and it's expected that some bugs we had to support older devices will be fixed, of course, nothing 100% sure.

Moreover, we plan to (re)explore cutlass which supports at least Tesla hardware (but they said that their -new- work will only target >= Ampere devices).


sam__izdat t1_ittzava wrote

Oh, I absolutely don't expect you folks to support a Tesla from 2014 -- I just kind of crossed my fingers and tried it anyway, on the remote chance that by some miracle it might work, like a magic spell to speed up this slowpoke.


Sylv__ t1_itu2oar wrote

Impressive work! Thank you for open-sourcing it.


gionnelles t1_itujc1u wrote

This is very exciting, my team will be checking this out ASAP. This is fantastic for R&D folks looking to move models towards production with much less effort.


gaetansnl t1_itunddm wrote

Hello ! I'm one of the maintainer of Kernl. Thanks for your comment ! Don't hesitate to give us feedback, and tell us if we can improve things for your use case :)


seek_it t1_iu1t7d4 wrote

Can I also use this project to improve inference time of projects like yolov5, etc.?


pommedeterresautee OP t1_iu2vg8y wrote

Right now the kernels cover linear layer, attention, and layer norm / rms norm. So the effect would be limited outside a transformer or assimilated. However we will increase the number of kernels, but convolution is not right now our priority


Lolologist t1_itux8cs wrote

This all looks very impressive!

I'm not terribly well-versed in the nitty-gritty of ML's underpinnings so forgive me if this is a dumb question but:

How might we apply your speedup to, say, spaCy? Is this something that is dragged and dropped in somewhere?


pommedeterresautee OP t1_ituya3z wrote

I have not used Spacy since years but my understanding is that for large models they leverage Hugging Face library (, so I would say it should work out of the box, the only thing is to catch the model instance and override it with the optimized version (it will take the very same input).

Maybe a redditer with more Spacy knowledge than I have can validate the approach...


reSAMpled t1_iuj2yrm wrote

I also haven't used spaCy in a while, but I am pretty sure there is not a way to make this work with -sm, -md or -lg models, but what Michaël says should be true for -trf models, but I don't think it will be easy. Already spacy-transformers has to wrap HF models so they have a thinc API, you would have to dig deep in there to call Kernl's optimize_model


programmerChilli t1_itv2ckn wrote

I'm somewhat surprised Inductor performs worse than cudagraphs given Inductor by default should be wrapped behind cudagraphs.


pommedeterresautee OP t1_itv3bu7 wrote

Yeah, it doesn't make sense to me either. Also I was expecting a bit better speedup (regarding those shared on the PyTorch dev forum). I tried several combinations of params (enabling the disabled optimizations) but they were either broken (eg matmul ops template) or making things slower.

Scripts are here:

Let me know if you find something suspicious.