Submitted by parabellum630 t3_z088fo in MachineLearning
trashacount12345 t1_ix5nkdm wrote
Did you debug on a single sample or batch?
Have you double checked you don’t have something like applying two sigmoids and therefore getting tiny gradients? I make that mistake pretty much every time I set up a new model.
parabellum630 OP t1_ix5yyv8 wrote
Oh my God. I used to do this too! I am happy I am not the only one!! But my monkey brain learned not to do this eventually. I have managed to get it to GRU performance by applying more warmup steps, learning rate scheduling, decreasing model size, using Pre-LN, doubling the batch size, and reducing the sequence length.
Viewing a single comment thread. View all comments