Submitted by fxmarty t3_z1titt in MachineLearning
JahrudZ t1_ixdx5d9 wrote
Reply to comment by younesbelkada in [P] BetterTransformer: PyTorch-native free-lunch speedups for Transformer-based models by fxmarty
Any idea why it would be mutually exclusive? Thanks
younesbelkada t1_ixdyvls wrote
because BetterTransformer merges the whole TransformerEncoderLayer operations in a single operation. This is called with the appropriate weights / biases at runtime.
For int8, each linear layer is replaced by the linear layer from bitsandbytes, that are slightly particular. At runtime it decomposes the matrix multiplication in two stages, and this is done with particular CUDA kernels. Therefore since this is not embedded in the fused operation from PyTorch, these two options are mutually exclusive. Please read more about int8 models here: https://huggingface.co/blog/hf-bitsandbytes-integration
fxmarty OP t1_ixe3kms wrote
To complete, if you were thinking about the more traditional 8-bits quantization with full 8-bits integer arithmetic, it is currently not usable along BetterTransformer. However, I don't see reasons why similar custom layers could not be implemented with fused kernels + quantization + optimization w.r.t. padding.
FlashAttention + quantization has to the best of knowledge not yet been explored, but I think it would be a great engineering direction. I would not expect to see this any time soon natively in PyTorch's BetterTransformer though. /u/pommedeterresautee & folks at ELS-RD made an awesome work releasing kernl where custom implementations (through OpenAI Triton) could maybe easily live.
Viewing a single comment thread. View all comments