Deep-Station-1746 t1_j146dhq wrote on December 21, 2022 at 3:24 PM

> reduce excess bulk in a NN without sacrificing performance

Simply put, that is not possible. There's literally always a trade-off. So, the question is, what are you willing to sacrifice? How much performance are you willing to forego?

Deep-Station-1746 t1_j146uw3 wrote on December 21, 2022 at 3:28 PM

The laziest option is fp16 quantization. As easy as model.half() on most torch-based models. Halves the physical size of the model. You could also try knowledge distillation (read up on how distilbert was made, for example). You can also do stuff that is more arch-specific, like if you have a transformer, you could use xformers efficient attention for example. The list goes on and on.

Red-Portal t1_j15c4yo wrote on December 21, 2022 at 7:54 PM

Not necessarily. If the neural networks had dense activations, what you said would have been true. But in reality, I think the answer cannot be a definite no.