Submitted by fxmarty t3_z1titt in MachineLearning
fxmarty OP t1_ixe3kms wrote
Reply to comment by JahrudZ in [P] BetterTransformer: PyTorch-native free-lunch speedups for Transformer-based models by fxmarty
To complete, if you were thinking about the more traditional 8-bits quantization with full 8-bits integer arithmetic, it is currently not usable along BetterTransformer. However, I don't see reasons why similar custom layers could not be implemented with fused kernels + quantization + optimization w.r.t. padding.
FlashAttention + quantization has to the best of knowledge not yet been explored, but I think it would be a great engineering direction. I would not expect to see this any time soon natively in PyTorch's BetterTransformer though. /u/pommedeterresautee & folks at ELS-RD made an awesome work releasing kernl where custom implementations (through OpenAI Triton) could maybe easily live.
Viewing a single comment thread. View all comments