Viewing a single comment thread. View all comments

shanereid1 t1_jdlt38a wrote

Have you read about the lotto ticket hypothesis? It was a paper from a few years ago that showed that within a fully connected neural network there exists a smaller sub network that can perform equally as well, even when the subnetwork is as low as a few % of the size of the original network. AFAIK they only proved this for MLP and CNNs. Its almost certain that the power of these LLMs can be distilled in some fashion without significantly degrading performance.

32

andreichiffa t1_jdvojfg wrote

It's a common result from the domain of flat minima - to train the model needs to be overparametrized to avoid getting stuck in local minima and to smooth the loss landscape.

However the overparameterization at the training stage can be trimmed at the inference stage.

1