Deep-Station-1746 t1_j146uw3 wrote on December 21, 2022 at 3:28 PM

Reply to comment by Deep-Station-1746 in Reduce paramter count in an NN without sacrificing performance [P] by ackbladder_

The laziest option is fp16 quantization. As easy as model.half() on most torch-based models. Halves the physical size of the model. You could also try knowledge distillation (read up on how distilbert was made, for example). You can also do stuff that is more arch-specific, like if you have a transformer, you could use xformers efficient attention for example. The list goes on and on.