Viewing a single comment thread. View all comments

fasttosmile t1_ix4er1a wrote on November 20, 2022 at 5:53 PM

None of the things you mentioned are close to as important as what your dataset is.

Also it's important to use AdamW with high weight decay.

parabellum630 OP t1_ix4flxr wrote on November 20, 2022 at 5:59 PM

Interesting, I will try this out as well.

fasttosmile t1_ix4gti4 wrote on November 20, 2022 at 6:07 PM

This is a great reference to follow: https://github.com/karpathy/minGPT

drivanova t1_ix9vpi7 wrote on November 21, 2022 at 9:17 PM

that + decent lr scheduler, e.g. linear ramp up + exponential/cosine annealing