ChangingHats
ChangingHats t1_ix49df8 wrote
Reply to [R] Tips on training Transformers by parabellum630
- What justification do you have for using 12 layers as opposed to 1?
- Why 800 hidden-dim?
- Why encoder-decoder instead of encoder-only or decoder-only?
- How are your tensors formatted, and how does that interact with the attention layer?
- Are your tensors even formatted correctly? Double-check EVERYTHING.
- Are you masking properly (time series)?
- Are you using an appropriate loss function?
- Are you using pre-norm, post-norm, ReZero?
- How are your weights being initialized?
- Why does the batch size need to be as high as possible? I've read that low batch sizes can be preferable, but ultimately this is data-dependent anyway. Do you have a reliable way of tuning the batch size? Keep in mind that varying batch sizes will affect your metrics unless your "test" datasets are always the same batch size regardless of the "train" and "validation" batch sizes.
- AFAIK, there's really only one learning rate and it's set in the optimizer/fit() call; whatever proportion of that error that gets backpropagated is really dependent on your model's internal structure.
- Remember that the original paper's model was made with respect to NLP, not your specific domain of concern. Screw around with the model structure as you see fit.
- What are you using for your feed forward part of your encoder/decoder layers? I use EinsumDense, others use Convolution, etc.
Ultimately what you need to do is analyze every single step of the process and keep track of how your data is being manipulated.
ChangingHats t1_j4r2hxx wrote
Reply to [D] Simple Questions Thread by AutoModerator
I am trying to utilize tensorflow's MultiHeadAttention to do regression on time series data for forecasting of a `(batch, horizon, features)` tensor.
During training, I have `inputs ~> (1, 10, 1)` and `targets ~> (1, 10, 1)`. `targets` is a horizon-shifted output of `inptus`.
During inference, `targets` is just a zeros tensor of the same shape.
What's the best way to run attention such that the output utilizes all timesteps in `inputs` as well as each subsequent timestep of the resulting attention output, instead of ONLY the timesteps of the inputs?
Another problem I see is that attention is run between Q and K, and during inference, Q = K, so that will affect the output differently, no?