ChangingHats t1_ix49df8 wrote on November 20, 2022 at 5:17 PM

What justification do you have for using 12 layers as opposed to 1?
Why 800 hidden-dim?
Why encoder-decoder instead of encoder-only or decoder-only?
How are your tensors formatted, and how does that interact with the attention layer?
Are your tensors even formatted correctly? Double-check EVERYTHING.
Are you masking properly (time series)?
Are you using an appropriate loss function?
Are you using pre-norm, post-norm, ReZero?
How are your weights being initialized?
Why does the batch size need to be as high as possible? I've read that low batch sizes can be preferable, but ultimately this is data-dependent anyway. Do you have a reliable way of tuning the batch size? Keep in mind that varying batch sizes will affect your metrics unless your "test" datasets are always the same batch size regardless of the "train" and "validation" batch sizes.
AFAIK, there's really only one learning rate and it's set in the optimizer/fit() call; whatever proportion of that error that gets backpropagated is really dependent on your model's internal structure.
Remember that the original paper's model was made with respect to NLP, not your specific domain of concern. Screw around with the model structure as you see fit.
What are you using for your feed forward part of your encoder/decoder layers? I use EinsumDense, others use Convolution, etc.

Ultimately what you need to do is analyze every single step of the process and keep track of how your data is being manipulated.

parabellum630 OP t1_ix4fc49 wrote on November 20, 2022 at 5:57 PM

Thank you!! I was experimenting with off-the-shelf implementation with little customization. I am using the transformer in an encoder fashion with 800 hidden dimensions due to the constraints of other models surrounding it. I will try out varying all these hyper parameters. Looks like it's going to be a long week.