Submitted by ackbladder_ t3_zrpsfm in MachineLearning

Hi,

For my final year project for my BSc CompSci and AI course I’m implementing the world models paper to play games. Essentially a variational autoencoder and another network to predict future latent states of the game environment.

The emphasis of my project is to reduce the number of parameters, and consequently the training time (making a case for reducing the energy consumption). I’ll use existing models and their size alongside game performance to compare with my own.

I’ve had trouble finding existing literature as to how this can be done. Obviously there isn’t a way to find an ‘optimal’ number required to solve a task, but wanted to find techniques to reduce excess bulk in a NN without sacrificing performance.

Does anyone have any ideas or know of any resources?

TIA

7

Comments

You must log in or register to comment.

ackbladder_ OP t1_j14gx51 wrote

Thanks for your reply. I assume that the trade of isn’t linear so hoping to find ‘Goldilocks’ point where the performance isn’t heavily affected or affected enough that it still passes a given task but not as well. I’ll look up knowledge distillation.

3

svantana t1_j14jwo4 wrote

Yeah, "distillation" is a key term here. Also, paperswithcode has joint data on performance and parameter counts, which gives a nice overview of the current pareto front. rwightman's repos is another nice resource.

4

CyberPun-K t1_j16mqoa wrote

Convolutional Neural Networks are an excellent example of how correct inductive biases can:

  1. Reduce number of parameters.
  2. Improve performance.
2

Deep-Station-1746 t1_j146dhq wrote

> reduce excess bulk in a NN without sacrificing performance

Simply put, that is not possible. There's literally always a trade-off. So, the question is, what are you willing to sacrifice? How much performance are you willing to forego?

−4

Deep-Station-1746 t1_j146uw3 wrote

The laziest option is fp16 quantization. As easy as model.half() on most torch-based models. Halves the physical size of the model. You could also try knowledge distillation (read up on how distilbert was made, for example). You can also do stuff that is more arch-specific, like if you have a transformer, you could use xformers efficient attention for example. The list goes on and on.

6

Red-Portal t1_j15c4yo wrote

Not necessarily. If the neural networks had dense activations, what you said would have been true. But in reality, I think the answer cannot be a definite no.

5