Viewing a single comment thread. View all comments

Deep-Station-1746 t1_j146uw3 wrote

The laziest option is fp16 quantization. As easy as model.half() on most torch-based models. Halves the physical size of the model. You could also try knowledge distillation (read up on how distilbert was made, for example). You can also do stuff that is more arch-specific, like if you have a transformer, you could use xformers efficient attention for example. The list goes on and on.

6