Viewing a single comment thread. View all comments

ChuckSeven t1_j8t5r5m wrote

yea it depends. Even just batch-size makes a difference. But for really big models, I'd assume that the number of weights far outweighs the number of activations.

3

MustachedSpud t1_j8t65fh wrote

Yeah very configuration dependent, but larger batch sizes usually learn faster so there's a tendency to lean into that

1