Viewing a single comment thread. View all comments

younesbelkada t1_ixdyvls wrote

because BetterTransformer merges the whole TransformerEncoderLayer operations in a single operation. This is called with the appropriate weights / biases at runtime.

For int8, each linear layer is replaced by the linear layer from bitsandbytes, that are slightly particular. At runtime it decomposes the matrix multiplication in two stages, and this is done with particular CUDA kernels. Therefore since this is not embedded in the fused operation from PyTorch, these two options are mutually exclusive. Please read more about int8 models here: https://huggingface.co/blog/hf-bitsandbytes-integration

5