pommedeterresautee t1_j7uwa71 wrote

At start the weights will be moved on the GPU. Then during training, the tokenizer will convert your strings to a int64 tensors. They are quite light, and those are moved to GPU during training. What you need is not the fastest CPU but one which can feed your GPU faster that the data it will consume. In GPT2 case, CPU like 7700 won't be an issue. Image or sounds (TTS, ASR) may have more demanding preprocessing during training.


pommedeterresautee OP t1_j7uk761 wrote

I just discovered the project https://github.com/ggerganov/whisper.cpp

As written in another comment, there is no way for (recent) CPU (even ARM ones) to be as fast as (recent) GPU on such big model (the list no GPU support in limitations).


That being said, the project looks super cool, tks for the pointer (I ordered a M2 Max, lots of fun to come :-) )


pommedeterresautee OP t1_j7tk4fx wrote

On large DL models like Whisper large, CPU is never on par with GPUs because CPU is latency oriented hardware and GPU is throughput oriented. The only ways large models are run on CPUs is by reducing the number of operations to perform like by sparsification or pruning.

Moreover, PyTorch is mostly C++ with a Python layer over it (for now at least, PyTorch 2.0 may be a start of change in this architecture). The Python layer brings most of the PyTorch latency.

And then, even C++ engine launching operations on GPU can not be on par with CUDA graphs (most of the time at least), because you have still to send instruction at a time, and there is still some latency overhead associated in running things that way, just much less than Python. With CUDA graphs there is almost none at all.There is a second thing not discussed here, it's that the graph of instructions is optimized.

Main drawback of CG is the memory overhead, you need at least to double the space taken for input tensors. On generative models with K/V cache, it matters as explained in this post. Plus you need to copy input tensors, which offsets a -very-small part of the gains (at least that s what we saw in our tests on Whisper and Bert / Roberta).

That is why TensorRT (a big C++ piece) for instance supports CUDA graphs.

Still, TBH, as you pointed out, the most important thing is that ... it's easier to build and run :-)


pommedeterresautee t1_iuaodj2 wrote

Yes for Ampere.

For HF models, the Kernels will work for most of them out of the box but you need to have search replace patterns for your specific architecture. That's why we do not have our own implementations of X and Y.

Check https://github.com/ELS-RD/kernl/blob/main/src/kernl/optimizer/linear.py for an example.


pommedeterresautee t1_iuacchj wrote

To mitigate precision issues:

  • on ONNX related engines, we built a tool to check the output of each node and tag those that won't behave well in fp16 or bf16. Described here: https://www.reddit.com/r/MachineLearning/comments/uwkpmt/p_what_we_learned_by_making_t5large_2x_faster/
  • on Kernl, we "just" understand what happens as the code is simple (and we wrote it). We choose to not do terrible things to make the inference faster, basically no approx in our kernels, and accumulation is in fp32 (basically it's even better than vanilla mixed precision, and still much faster). IMO that's the most robust approach...

pommedeterresautee t1_iu9zg8x wrote

Hi, author of transformer deploy and Kernl here. Whatever option you choose, something to keep in mind next to speed is being able to maintain precision output. I can tell you it’s our number one pain point, on both tensorrt and onnx runtime. We have even built some tooling to help on that, it helped but it’s not yet perfect. Triton inference server is really a cool option with a good documentation.


pommedeterresautee t1_iu9vc6f wrote

Hi, I am one of the authors of transformer deploy. I have seen you have copied most of the files for the transformer part. That’s really cool, I really appreciate you kept the licenses, may I ask you to cite our work in the Readme ?

Moreover, if I may, why did you copied instead of just importing a dependency? You would get the maintenance for free :-)


pommedeterresautee OP t1_itv3bu7 wrote

Yeah, it doesn't make sense to me either. Also I was expecting a bit better speedup (regarding those shared on the PyTorch dev forum). I tried several combinations of params (enabling the disabled optimizations) but they were either broken (eg matmul ops template) or making things slower.

Scripts are here: https://github.com/ELS-RD/kernl/tree/main/experimental/benchmarks

Let me know if you find something suspicious.


pommedeterresautee OP t1_ituya3z wrote

I have not used Spacy since years but my understanding is that for large models they leverage Hugging Face library (https://spacy.io/universe/project/spacy-transformers), so I would say it should work out of the box, the only thing is to catch the model instance and override it with the optimized version (it will take the very same input).

Maybe a redditer with more Spacy knowledge than I have can validate the approach...


pommedeterresautee OP t1_itu4mr6 wrote

>Why is it that we don't see any projects with similar speedups using custom CUDA kernels or custom ONNX operators?

To be honest, we had the very same question :-)

CUDA is powerful... and verbose. To target several generations of hardware you need some deep knowledge of their characteristics. I have many times followed people from Microsoft on a PR implementing some new model, it takes them often 1 month or more. On TensorRT I suppose it's even harder as they generate code but hey, it's a black box. For best perf, CUDA code could be good, but you need nvcc to generate the right set of PTX instructions to reach peak perf which is not always the case from what I saw.

Hopefully, people of Nvidia working on Cutlass try to make those things easier by taking care of the lowest level of Cuda implementations. The lib is not, right now, what you would call, easy to grasp but you really learn a lot by working with it (much more than starting from scratch as you see what is the right way to implement stuff).

There are several reasons why you don't see more Triton:

- many people work with it but not in OSS (Anthropic, OpenAI, etc.). You can guess through issues and repo stars that the language is growing faster and faster since a few months

- educative material ... could be more smooth, it's a bit first tuto (add 2 vecs) is boringly simple, on matmul one there is a block you need to look during long minutes to understand what it does, and fused attention, it took us days to understand each line... and realize that it was not really the Flash Attention paper (like one of us implemented the paper, the other worked on Triton example and we were arguing during days about everything until we realized that it was not parallelized at the same level...).

Things will change, PyTorch has choose Triton language as their default one to compile GPU models for future PyTorch version (I guess version 1.14, not sure). More about it here -> https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747

There are certainly other reasons (like big corps can't rely on other big corps techno without some guarantees, etc.) but I think those above are very important explanations.

To be honest, we have been very surprised by the speedups ourselves, beating TensorRT on long sequences was definitely far above our objectives. Even more crazy when you think we have still margins for more speedups... (like we don't yet tuned blocks sizes on some kernels, etc.)

Let's see where it brings us...