neuroguy123 t1_ix5gflf wrote on November 20, 2022 at 10:03 PM

I agree with /u/erannare that data size is likely the most relevant issue. I have done a lot of training on time-series data with Transformers and they can be quite difficult to train from scratch on medium-sized datasets. This is even outlined in the main paper. Most people simply do not have enough data in new problem spaces to properly take advantage of Transformer models, despite their allure. My suggestions:

Use a hybrid-model as in the original paper. Apply some kind of ResNet, RNN, or whatever is appropriate first as a 'header' to the transformer that will generate the tokens for you. This will create a filter bank for you that may reduce the problem space of the Transformer.
A learning-rate scheduler is important.
Pre-norm probably will help.
Positional encoding is essential and has to be done properly. Run unit-test code for this.
Maybe find similar data that you can pre-train with. There may be some decent knowledge transfer from adjacent time-series problems. There is A LOT of audio data out there.
The original Transformer architecture becomes very difficult to train after about 500 tokens and you may be exceeding that. You will have to either break down your data series into less tokens or use some other architectures that get around that limit. I find in addition to the quadratic issues of memory, you need even more data to train larger Transformers (not surprisingly).
As someone else pointed out, double and triple check your masking code and create unit tests for that. It's very easy to get wrong.

All of that being said, test vs just more specialized architectures if you have long time-series data. It will take a lot of data and a lot of compute to fully take advantage of a Transformer on its own as an end-to-end architecture. RNNs and Wavenets are still relevant architectures as well whether your network is autoregressive, a classifier, or both.

leoholt t1_ix5ygsg wrote on November 21, 2022 at 12:16 AM

Would you mind elaborating what a unit-test for the positional encoding is? I'm quite new to this and would love to give it a shot.

Exarctus t1_ix7avzz wrote on November 21, 2022 at 7:58 AM

You basically want to extensively test that the sequential elements in your input are being mapped to unique vectors.