Viewing a single comment thread. View all comments

fasttosmile t1_ix4er1a wrote

None of the things you mentioned are close to as important as what your dataset is.

Also it's important to use AdamW with high weight decay.

4

drivanova t1_ix9vpi7 wrote

that + decent lr scheduler, e.g. linear ramp up + exponential/cosine annealing

1