Submitted by begooboi t3_119zmpd in deeplearning
We know that 175 billion GPT model generates better text than 1 billion parameter GPT model. In CNN we know that deeper models will learn more complex feature maps which makes them better image learners. Is there any such theory which explains the performance of big transformers?
Appropriate_Ant_4629 t1_j9p4z0e wrote
A bigger array holds more information than a smaller one.
^(You'd need to refine your question. It's obvious that a bigger model could outperform a smaller one -- simply by noticing that it could be identical to the smaller one by just setting the rest of it weights to zero. For every single one of those weights, if there's any value better than zero, the larger model would be better.)