Hi,

For my final year project for my BSc CompSci and AI course I’m implementing the world models paper to play games. Essentially a variational autoencoder and another network to predict future latent states of the game environment.

The emphasis of my project is to reduce the number of parameters, and consequently the training time (making a case for reducing the energy consumption). I’ll use existing models and their size alongside game performance to compare with my own.

I’ve had trouble finding existing literature as to how this can be done. Obviously there isn’t a way to find an ‘optimal’ number required to solve a task, but wanted to find techniques to reduce excess bulk in a NN without sacrificing performance.

Does anyone have any ideas or know of any resources?

TIA

Comments

You must log in or register to comment.

ackbladder_ OP t1_j14gx51 wrote on December 21, 2022 at 4:34 PM

Thanks for your reply. I assume that the trade of isn’t linear so hoping to find ‘Goldilocks’ point where the performance isn’t heavily affected or affected enough that it still passes a given task but not as well. I’ll look up knowledge distillation.

svantana t1_j14jwo4 wrote on December 21, 2022 at 4:53 PM

Yeah, "distillation" is a key term here. Also, paperswithcode has joint data on performance and parameter counts, which gives a nice overview of the current pareto front. rwightman's repos is another nice resource.

CyberPun-K t1_j16mqoa wrote on December 22, 2022 at 1:24 AM

Convolutional Neural Networks are an excellent example of how correct inductive biases can:

Reduce number of parameters.
Improve performance.

Deep-Station-1746 t1_j146dhq wrote on December 21, 2022 at 3:24 PM

> reduce excess bulk in a NN without sacrificing performance

Simply put, that is not possible. There's literally always a trade-off. So, the question is, what are you willing to sacrifice? How much performance are you willing to forego?

Deep-Station-1746 t1_j146uw3 wrote on December 21, 2022 at 3:28 PM

The laziest option is fp16 quantization. As easy as model.half() on most torch-based models. Halves the physical size of the model. You could also try knowledge distillation (read up on how distilbert was made, for example). You can also do stuff that is more arch-specific, like if you have a transformer, you could use xformers efficient attention for example. The list goes on and on.

Red-Portal t1_j15c4yo wrote on December 21, 2022 at 7:54 PM

Not necessarily. If the neural networks had dense activations, what you said would have been true. But in reality, I think the answer cannot be a definite no.