parabellum630 OP t1_ix5yyv8 wrote on November 21, 2022 at 12:19 AM

Reply to comment by trashacount12345 in [R] Tips on training Transformers by parabellum630

Oh my God. I used to do this too! I am happy I am not the only one!! But my monkey brain learned not to do this eventually. I have managed to get it to GRU performance by applying more warmup steps, learning rate scheduling, decreasing model size, using Pre-LN, doubling the batch size, and reducing the sequence length.