Viewing a single comment thread. View all comments

debrises t1_j3itq4j wrote

Larger batch sizes lead to a better gradient estimation, meaning, optimizer steps tend to be in the “right” direction, thus leading to faster convergence.

Run a test epoch to see when your model converges, and then use slightly more epochs so that your model can try to find different minimum points. And use model checkpoint callback.

As for loss, just use an Optimizer from the Adam family, like AdamW. It handles most of the problems that can happen pretty well.

The learning rate heavily depends on what range of values your loss has. Think about it this way: if your loss is equal to 10 then using the lr of 0.01 will get us 10 * 0.01 = 0.1. We then compute partial derivatives of this value with respect to each weight and backpropagate that and update our weights. Usually, we want our weights to have small values and to be centered around zero, updating them by even smaller values every step. The point is that your model doesn't know what values your loss takes and thus, you have to optimize the learning rate to find that nice value that connects your loss signal to your weights.

1