CommunismDoesntWork t1_ixcqvfr wrote on November 22, 2022 at 1:45 PM

What's the theory behind PTQ? As in, if quantization can preserve accuracy and create a massive speed up, why wouldn't you train on int8 to begin with? Speeding up training allows you to use even more parameters, or cut costs.

diviramon t1_iydkhtc wrote on November 30, 2022 at 4:40 PM

Quantization only really works for inference. During training, the gradients are very sensitive to the decimal precision so FP32 is needed to compute them and for the training to converge. I have not seen a lot of training in INT8.

CommunismDoesntWork t1_iydruw8 wrote on November 30, 2022 at 5:28 PM

Has anyone checked to see if training fundamentally needs all that precision? Intuitively, I can understand why it works better that way, but if a model can be converted to int8 after the fact without taking a huge hit in accuracy, then I don't see why an optimizer couldn't find that int8 network in the first place.

diviramon t1_iydw5aq wrote on November 30, 2022 at 5:55 PM

Yeah - a quick search showed some attempts on RN50 and Mobilenet, but nothing on transformers (not surprising since INT8 quant for Bert is very hard). However, it seems like all the INT8 focus is shifting towards MF8 (edit FP8) which should be more suitable for training as well.

CommunismDoesntWork t1_iyecw45 wrote on November 30, 2022 at 7:42 PM

> MF8

I've never heard of this and google isn't being helpful. Any links?

diviramon t1_iyejg7z wrote on November 30, 2022 at 8:24 PM

It is the new Nvidia FP8 data type: https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/