Submitted by parabellum630 t3_z088fo in MachineLearning
fasttosmile t1_ix4er1a wrote
None of the things you mentioned are close to as important as what your dataset is.
Also it's important to use AdamW with high weight decay.
parabellum630 OP t1_ix4flxr wrote
Interesting, I will try this out as well.
fasttosmile t1_ix4gti4 wrote
This is a great reference to follow: https://github.com/karpathy/minGPT
drivanova t1_ix9vpi7 wrote
that + decent lr scheduler, e.g. linear ramp up + exponential/cosine annealing
Viewing a single comment thread. View all comments