fxmarty OP t1_ixe3kms wrote on November 22, 2022 at 7:20 PM

Reply to comment by JahrudZ in [P] BetterTransformer: PyTorch-native free-lunch speedups for Transformer-based models by fxmarty

To complete, if you were thinking about the more traditional 8-bits quantization with full 8-bits integer arithmetic, it is currently not usable along BetterTransformer. However, I don't see reasons why similar custom layers could not be implemented with fused kernels + quantization + optimization w.r.t. padding.

FlashAttention + quantization has to the best of knowledge not yet been explored, but I think it would be a great engineering direction. I would not expect to see this any time soon natively in PyTorch's BetterTransformer though. /u/pommedeterresautee & folks at ELS-RD made an awesome work releasing kernl where custom implementations (through OpenAI Triton) could maybe easily live.