pommedeterresautee t1_iu9zg8x wrote on October 29, 2022 at 6:37 PM

Hi, author of transformer deploy and Kernl here. Whatever option you choose, something to keep in mind next to speed is being able to maintain precision output. I can tell you it’s our number one pain point, on both tensorrt and onnx runtime. We have even built some tooling to help on that, it helped but it’s not yet perfect. Triton inference server is really a cool option with a good documentation.

big_dog_2k OP t1_iua8wgp wrote on October 29, 2022 at 7:45 PM

Thanks. I was aware of this and had some difficultly in the past. Evaluation criteria now compares precision loss across model outputs as well as the performance (accuracy or equivalent) measured on the full system. What methods have you found to mitigate this? I would love to know!

pommedeterresautee t1_iuacchj wrote on October 29, 2022 at 8:10 PM

To mitigate precision issues:

on ONNX related engines, we built a tool to check the output of each node and tag those that won't behave well in fp16 or bf16. Described here: https://www.reddit.com/r/MachineLearning/comments/uwkpmt/p_what_we_learned_by_making_t5large_2x_faster/
on Kernl, we "just" understand what happens as the code is simple (and we wrote it). We choose to not do terrible things to make the inference faster, basically no approx in our kernels, and accumulation is in fp32 (basically it's even better than vanilla mixed precision, and still much faster). IMO that's the most robust approach...

big_dog_2k OP t1_iuaexff wrote on October 29, 2022 at 8:28 PM

Thank you! I think I will try kernl today as well. If I understand correctly, only Ampere generation cards are supported? Also, does it work on any huggingface model or are there still exceptions?

pommedeterresautee t1_iuaodj2 wrote on October 29, 2022 at 9:36 PM

Yes for Ampere.

For HF models, the Kernels will work for most of them out of the box but you need to have search replace patterns for your specific architecture. That's why we do not have our own implementations of X and Y.

Check https://github.com/ELS-RD/kernl/blob/main/src/kernl/optimizer/linear.py for an example.

big_dog_2k OP t1_iuaw55q wrote on October 29, 2022 at 10:35 PM

Great. I might try this out as I like the direction this is going plus it seems like Pytorch is heading in a similar way. I'll let you know if I have questions or I will raise them on github. I appreciate all the information!