JahrudZ t1_ixdx5d9 wrote on November 22, 2022 at 6:38 PM

Reply to comment by younesbelkada in [P] BetterTransformer: PyTorch-native free-lunch speedups for Transformer-based models by fxmarty

Any idea why it would be mutually exclusive? Thanks

younesbelkada t1_ixdyvls wrote on November 22, 2022 at 6:49 PM

because BetterTransformer merges the whole TransformerEncoderLayer operations in a single operation. This is called with the appropriate weights / biases at runtime.

For int8, each linear layer is replaced by the linear layer from bitsandbytes, that are slightly particular. At runtime it decomposes the matrix multiplication in two stages, and this is done with particular CUDA kernels. Therefore since this is not embedded in the fused operation from PyTorch, these two options are mutually exclusive. Please read more about int8 models here: https://huggingface.co/blog/hf-bitsandbytes-integration

fxmarty OP t1_ixe3kms wrote on November 22, 2022 at 7:20 PM

To complete, if you were thinking about the more traditional 8-bits quantization with full 8-bits integer arithmetic, it is currently not usable along BetterTransformer. However, I don't see reasons why similar custom layers could not be implemented with fused kernels + quantization + optimization w.r.t. padding.

FlashAttention + quantization has to the best of knowledge not yet been explored, but I think it would be a great engineering direction. I would not expect to see this any time soon natively in PyTorch's BetterTransformer though. /u/pommedeterresautee & folks at ELS-RD made an awesome work releasing kernl where custom implementations (through OpenAI Triton) could maybe easily live.