erannare t1_ix44g1p wrote on November 20, 2022 at 4:43 PM

Dataset size is a BIG factor here. Transformers are very data hungry. They present a much larger hypothesis space and thus take a lot more data to train.

Cheap_Meeting t1_ix56uu6 wrote on November 20, 2022 at 8:58 PM

>but the transformer is stuck and loss doesn't decrease after abt 20k steps.

Presumably they meant training loss, which would indicate that this is an optimization problem.

waa007 t1_ix7s5k7 wrote on November 21, 2022 at 12:08 PM

Maybe, There is too little data and model overfit, mode parameter got stuck in locally optimal result, Is it possible?