Submitted by parabellum630 t3_z088fo in MachineLearning
I am using transformers for music and dance sequential data. I am using a 12- layer, 800 hidden-dim, vanilla full attention architecture from the original attention is all you need paper. My data is audio features (MFCC, energy, envelop). A GRU architecture works really well and converges in about 15k steps but the transformer is stuck and loss doesn't decrease after abt 20k steps.
These are the things I learned:
- Bigger architectures learn better and train faster
- Layer norms are very important
- Apply high learning rates to top layers and smaller rates to lower layers
- The batch size should be as high as possible
However, I have no clue how to troubleshoot my network to see which of these cases are the problem. Any general tips that have worked for you guys while debugging Transformers?
erannare t1_ix44g1p wrote
Dataset size is a BIG factor here. Transformers are very data hungry. They present a much larger hypothesis space and thus take a lot more data to train.