pommedeterresautee OP t1_ittyyn3 wrote

Kepler gen is a bit old, but we may increase hardware support in the future.

First Triton is going through a big rewriting and it's expected that some bugs we had to support older devices will be fixed, of course, nothing 100% sure.

Moreover, we plan to (re)explore cutlass which supports at least Tesla hardware (but they said that their -new- work will only target >= Ampere devices).


pommedeterresautee OP t1_itty9y4 wrote

Yes we are!

In the post there is a link to T5 notebook, we did a rapid test, and speedup on T5 is really high (6X). It's just the beginning. Existing kernels probably already works with most generative languages (like GPT2, etc.), we just need to write replacement patterns (to search the PyTorch part and replace it with our kernels).

T5 notebook : https://github.com/ELS-RD/kernl/blob/main/tutorial/t5%20e2e.ipynb

We are currently working on RMSNorm, a kind of simplified LayerNorm used in T5 (kernel done and merged, we are focusing on the replacement pattern).

Quite surprisingly, RMSNorm bring a huge unexpected speedup on what we already had! If you want to follow this work: https://github.com/ELS-RD/kernl/pull/107

If you can't wait to use those kernels on your model, there is a part in the README of the project which explains how to write replacement pattern, it should be quite easy.


pommedeterresautee OP t1_ittsubj wrote

Thank you a lot for *your* work and your message :-)

Regarding the bugs, for now they have been mostly workable, we follow with lots of excitement the MLIR rewriting and try to prepare ourselves.

I am really wondering what will happen to the ML community when Pytorch will release TorchDynamo / Inductor and so many people will start using Triton in their day to day work. Then tens of thousands of people or more with different backgrounds may start writing kernels...

As they say, what a time to be alive!