Submitted by parabellum630 t3_z088fo in MachineLearning

I am using transformers for music and dance sequential data. I am using a 12- layer, 800 hidden-dim, vanilla full attention architecture from the original attention is all you need paper. My data is audio features (MFCC, energy, envelop). A GRU architecture works really well and converges in about 15k steps but the transformer is stuck and loss doesn't decrease after abt 20k steps.

These are the things I learned:

  1. Bigger architectures learn better and train faster
  2. Layer norms are very important
  3. Apply high learning rates to top layers and smaller rates to lower layers
  4. The batch size should be as high as possible

However, I have no clue how to troubleshoot my network to see which of these cases are the problem. Any general tips that have worked for you guys while debugging Transformers?

78

Comments

You must log in or register to comment.

erannare t1_ix44g1p wrote

Dataset size is a BIG factor here. Transformers are very data hungry. They present a much larger hypothesis space and thus take a lot more data to train.

44

Cheap_Meeting t1_ix56uu6 wrote

>but the transformer is stuck and loss doesn't decrease after abt 20k steps.

Presumably they meant training loss, which would indicate that this is an optimization problem.

14

waa007 t1_ix7s5k7 wrote

Maybe, There is too little data and model overfit, mode parameter got stuck in locally optimal result, Is it possible?

0

suflaj t1_ix4bbi0 wrote

3. and 4. in your case are probably intertwined, and likely the reason why you are stuck. You should probably keep the learning rate constant for all layers, freeze some at most if you're dealing with a big shift when finetuning.

You should use warmup, a low learning rate (what that entails depends, but since music data is similar to text, that means 1e-6 to 1e-5 maximum learning rate), and increase batch size if you get stuck as training progresses.

Without warmup, your network will not converge.

With a high learning rate, it will likely diverge on a nasty sample, or even have its gradients explode. In practice even when using gradient clipping your network might run in a circle, depending on your samples.

By lowering the learning rate when stuck it will not generalize well, but increasing the batch size (even if you slightly increase the learning rate while you're at it) seems to fix the problem, you just have to find the right numbers. I work on text, so whenever I doubled the batch size, I increased the learning rate by a factor of square or cube root of 2 to keep the "learning pressure" the same. YMMV.

EDIT: And as other people said make sure you have a large enough datasets. Transformers have almost no inductive biases, meaning that they have to learn them from data. Unless your augmentations are really good, I wouldn't recommend even attempting to train a transformer without at least 100k-1mil unique samples. For the size you're mentioning, the model would ideally like 1-10mil samples for finetuning and 1-10bil for pretraining.

28

parabellum630 OP t1_ix4eozs wrote

Thank you so much for these insights!! I will try these out.

5

ChangingHats t1_ix49df8 wrote

  • What justification do you have for using 12 layers as opposed to 1?
  • Why 800 hidden-dim?
  • Why encoder-decoder instead of encoder-only or decoder-only?
  • How are your tensors formatted, and how does that interact with the attention layer?
  • Are your tensors even formatted correctly? Double-check EVERYTHING.
  • Are you masking properly (time series)?
  • Are you using an appropriate loss function?
  • Are you using pre-norm, post-norm, ReZero?
  • How are your weights being initialized?
  • Why does the batch size need to be as high as possible? I've read that low batch sizes can be preferable, but ultimately this is data-dependent anyway. Do you have a reliable way of tuning the batch size? Keep in mind that varying batch sizes will affect your metrics unless your "test" datasets are always the same batch size regardless of the "train" and "validation" batch sizes.
  • AFAIK, there's really only one learning rate and it's set in the optimizer/fit() call; whatever proportion of that error that gets backpropagated is really dependent on your model's internal structure.
  • Remember that the original paper's model was made with respect to NLP, not your specific domain of concern. Screw around with the model structure as you see fit.
  • What are you using for your feed forward part of your encoder/decoder layers? I use EinsumDense, others use Convolution, etc.

Ultimately what you need to do is analyze every single step of the process and keep track of how your data is being manipulated.

7

parabellum630 OP t1_ix4fc49 wrote

Thank you!! I was experimenting with off-the-shelf implementation with little customization. I am using the transformer in an encoder fashion with 800 hidden dimensions due to the constraints of other models surrounding it. I will try out varying all these hyper parameters. Looks like it's going to be a long week.

2

fasttosmile t1_ix4er1a wrote

None of the things you mentioned are close to as important as what your dataset is.

Also it's important to use AdamW with high weight decay.

4

drivanova t1_ix9vpi7 wrote

that + decent lr scheduler, e.g. linear ramp up + exponential/cosine annealing

1

neuroguy123 t1_ix5gflf wrote

I agree with /u/erannare that data size is likely the most relevant issue. I have done a lot of training on time-series data with Transformers and they can be quite difficult to train from scratch on medium-sized datasets. This is even outlined in the main paper. Most people simply do not have enough data in new problem spaces to properly take advantage of Transformer models, despite their allure. My suggestions:

  • Use a hybrid-model as in the original paper. Apply some kind of ResNet, RNN, or whatever is appropriate first as a 'header' to the transformer that will generate the tokens for you. This will create a filter bank for you that may reduce the problem space of the Transformer.
  • A learning-rate scheduler is important.
  • Pre-norm probably will help.
  • Positional encoding is essential and has to be done properly. Run unit-test code for this.
  • Maybe find similar data that you can pre-train with. There may be some decent knowledge transfer from adjacent time-series problems. There is A LOT of audio data out there.
  • The original Transformer architecture becomes very difficult to train after about 500 tokens and you may be exceeding that. You will have to either break down your data series into less tokens or use some other architectures that get around that limit. I find in addition to the quadratic issues of memory, you need even more data to train larger Transformers (not surprisingly).
  • As someone else pointed out, double and triple check your masking code and create unit tests for that. It's very easy to get wrong.

All of that being said, test vs just more specialized architectures if you have long time-series data. It will take a lot of data and a lot of compute to fully take advantage of a Transformer on its own as an end-to-end architecture. RNNs and Wavenets are still relevant architectures as well whether your network is autoregressive, a classifier, or both.

4

leoholt t1_ix5ygsg wrote

Would you mind elaborating what a unit-test for the positional encoding is? I'm quite new to this and would love to give it a shot.

3

Exarctus t1_ix7avzz wrote

You basically want to extensively test that the sequential elements in your input are being mapped to unique vectors.

2

yannbouteiller t1_ix4aesl wrote

We are also currently struggling to train a Transformer for 1D sequential data in the hope that this may eventually outperform our state-of-the-art model based on a mix of CNN, GRU and time-dilation. First, you need to be careful about what you use as positional encoding because in low-dimensional embeddings it can easily destroy your data. Then, according to the papers, dataset size will likely be a huge factor, in the sense that you will need a huge dataset, because Transformers might lack inductive bias compared to, e.g., GRUs and you need an enormous amount of data to compensate for that.

3

hadaev t1_ix5eduw wrote

Just replace gru with transformer and keep cnn as positional encoding.

5

sigmoid_amidst_relus t1_ix8gx9z wrote

Although you've gotten some good answers, here are some things I've learned in the past 1.5 years working with transformers on audio and speech data.

  1. Learning rate schedule is more important with audio data that is more "in the wild", i.e. large variations in SNR.
  2. Is your music data loudness normalized? Might help. Although following step 3 should take care of it.
  3. While not centring data to zero mean and std works, standardizing has proven critical for consistent training runs for spectral data for my setup. Without it, while there was not much difference in best runs, my model would give very different results for different seeds. I'd recheck that your data is mean/std normed correctly, and if you aren't doing it, you should. You can do it on either per-instance or dataset level (computing mean/std statistics over the entire dataset), and standardize every frequency bin independently or not, based on your use case.
  4. Keep an eye on your gradient norms during training to check if your learning rate schedule is appropriate or not.
  5. Use linear warmup. Also, try using Adam or AdamW if you're not. SGD will need significantly more hyperparam tuning for transformers.
  6. Just in case you're doing this, do not use different depthwise learning rates if training from scratch.
3

Cheap_Meeting t1_ix56atv wrote

I disagree with some of the other advice here. I would suggest starting with something that you know works. That means you could either use a training setup from another modality such as vision or text and apply it to your data, or you could try to reproduce a result from the literature first.

2

trashacount12345 t1_ix5nkdm wrote

Did you debug on a single sample or batch?

Have you double checked you don’t have something like applying two sigmoids and therefore getting tiny gradients? I make that mistake pretty much every time I set up a new model.

1

parabellum630 OP t1_ix5yyv8 wrote

Oh my God. I used to do this too! I am happy I am not the only one!! But my monkey brain learned not to do this eventually. I have managed to get it to GRU performance by applying more warmup steps, learning rate scheduling, decreasing model size, using Pre-LN, doubling the batch size, and reducing the sequence length.

2

JTat79 t1_ix4z496 wrote

Ha funny title, caveman brain pleased

−3