faschu OP t1_j2xt5i6 wrote on January 4, 2023 at 6:24 PM

Reply to comment by _Arsenie_Boca_ in [Discussion]: Quantization in native pytorch for GPUs (Cuda)? by faschu

Thanks for the reply!

I personally find TensorRT hard to debug and I prefer to use it only in production when I'm absolutely sure that the model produces the desired results.

Sylv__ t1_j2yrga7 wrote on January 4, 2023 at 9:52 PM

Well, you can always debug / try quantization configs with fake quantization on GPU. And once one is good enough for you, move to TensorRT, although AFAIK the support in TRT is very limited. Of course, this will only allow you to benchmark configs for prediction quality, not speedup.

Maybe there will be a support for quantized kernels in torchinductor? I recall reading around this in a github issue at some point.

Otherwise you could try bitsandbytes, and pass the good param to do all computations in 8-bit.

The authors of SmoothQuant implemented as well torch-int, which is a wrapper around CUTLASS to use int8 GEMM. You can find it on github!

faschu OP t1_j3q0sr7 wrote on January 10, 2023 at 7:21 AM

Thanks a lot for the detailed reply! I will try these suggestions.