sigmoid_amidst_relus t1_ix8gx9z wrote on November 21, 2022 at 3:40 PM

Although you've gotten some good answers, here are some things I've learned in the past 1.5 years working with transformers on audio and speech data.

Learning rate schedule is more important with audio data that is more "in the wild", i.e. large variations in SNR.
Is your music data loudness normalized? Might help. Although following step 3 should take care of it.
While not centring data to zero mean and std works, standardizing has proven critical for consistent training runs for spectral data for my setup. Without it, while there was not much difference in best runs, my model would give very different results for different seeds. I'd recheck that your data is mean/std normed correctly, and if you aren't doing it, you should. You can do it on either per-instance or dataset level (computing mean/std statistics over the entire dataset), and standardize every frequency bin independently or not, based on your use case.
Keep an eye on your gradient norms during training to check if your learning rate schedule is appropriate or not.
Use linear warmup. Also, try using Adam or AdamW if you're not. SGD will need significantly more hyperparam tuning for transformers.
Just in case you're doing this, do not use different depthwise learning rates if training from scratch.