fxmarty t1_j8r8inv wrote on February 16, 2023 at 11:21 AM

Reply to comment by qalis in [D] HuggingFace considered harmful to the community. /rant by drinkingsomuchcoffee

Thank you for the feedback, I feel the same it does not make much sense. My understanding is that the goal is to be compatible with transformers pipelines - but it makes things a bit illogical trying to mix ONNX Runtime and PyTorch.

That said, Optimum is an open-source library, and you are very free to submit a PR or to do this kind of request in the github issues!

fxmarty OP t1_ixi7sge wrote on November 23, 2022 at 5:09 PM

Reply to comment by killver in [P] BetterTransformer: PyTorch-native free-lunch speedups for Transformer-based models by fxmarty

Not that I know of (at least in the ONNX ecosystem). I would recommend tuning the available arguments: https://github.com/microsoft/onnxruntime/blob/9168e2573836099b841ab41121a6e91f48f45768/onnxruntime/python/tools/quantization/quantize.py#L414

If you are dealing with a canonical model, feel free to fill an issue as well!

fxmarty OP t1_ixi140r wrote on November 23, 2022 at 4:25 PM

Reply to comment by killver in [P] BetterTransformer: PyTorch-native free-lunch speedups for Transformer-based models by fxmarty

Are you doing dynamic or static quantization? Static quantization can be tricky, usually dynamic quantization is more straightforward. Also, if you deal with encoder-decoder models, it could be that quantization error accumulates in the decoder. For the slowdowns you are seeing... there could be many reasons. The first thing you should check is whether running through ONNX Runtime / OpenVino is at least on par (if not better) than PyTorch eager. If not, there may be an issue at a higher level (e.g. here). If yes, it could be your CPU does not support AVX VNNI instructions for example. Also depending on batch size, sequence length, the speedups from quantization may greatly vary.

Yes Optimum lib's documentation is unfortunately not yet in best shape. I would be really thankful if you fill an issue detailing where the doc can be improved: https://github.com/huggingface/optimum/issues . Also, if you have feature requests, such as having a more flexible API, we are eager for community contributions or suggestions!

fxmarty OP t1_ixhpdui wrote on November 23, 2022 at 3:07 PM

Reply to comment by killver in [P] BetterTransformer: PyTorch-native free-lunch speedups for Transformer-based models by fxmarty

It's a vast question really. If you are able to convert your model to ONNX with meaningful outputs, that's a good start, it means you don't have dynamic control flows and your model is tracable.

I could recommend giving a try to OpenVino, or ONNX Runtime. They both can consume ONNX intermediate representation.

If you are specifically dealing with transformer-based models inheriting from the implementations in Transformers library, I would recommend to give a look at https://huggingface.co/blog/openvino and the documentation (and Optimum for ONNX Runtime, it could make your life easier).

Overall, compression techniques like structured pruning and quantization can be leveraged on CPUs - but once you start going in edge cases there may be diminishing benefits compared to the time spent on trying to optimize. Neural Magic has a closed-source inference engine that seem to have good recipes to exploit sparsity on CPUs.

Did not read it but this paper from Intel looks interesting: https://arxiv.org/abs/2211.07715

fxmarty OP t1_ixgnwin wrote on November 23, 2022 at 8:20 AM

Reply to comment by Lewba in [P] BetterTransformer: PyTorch-native free-lunch speedups for Transformer-based models by fxmarty

Unfortunately, the ONNX export with BetterTransformer will not work. It's a bit unfortunate the model optimization / compression efforts are spread out between different (sometimes) incompatible tools, but then again different use cases require different toolings.

fxmarty OP t1_ixe3kms wrote on November 22, 2022 at 7:20 PM

Reply to comment by JahrudZ in [P] BetterTransformer: PyTorch-native free-lunch speedups for Transformer-based models by fxmarty

To complete, if you were thinking about the more traditional 8-bits quantization with full 8-bits integer arithmetic, it is currently not usable along BetterTransformer. However, I don't see reasons why similar custom layers could not be implemented with fused kernels + quantization + optimization w.r.t. padding.

FlashAttention + quantization has to the best of knowledge not yet been explored, but I think it would be a great engineering direction. I would not expect to see this any time soon natively in PyTorch's BetterTransformer though. /u/pommedeterresautee & folks at ELS-RD made an awesome work releasing kernl where custom implementations (through OpenAI Triton) could maybe easily live.

fxmarty OP t1_ixd5yaq wrote on November 22, 2022 at 3:38 PM

Reply to comment by visarga in [P] BetterTransformer: PyTorch-native free-lunch speedups for Transformer-based models by fxmarty

I believe it does not in PyTorch 1.13. However if you try PyTorch nightlies there is support for FlashAttention and MemoryEfficientAttention. Example notebook: https://colab.research.google.com/drive/1eCDJ4pql8102J_BtGSyjCRJwLp3TTN_h . Digging into the source code of PyTorch we indeed see them.

However, this is only limited to inference for now, but given that there is work from PyTorch's team to include this natively, I would expect to see support for training in the future!