notforrob t1_j2lnrhg wrote on January 2, 2023 at 6:07 AM

Assuming your goal is an autoregressive sequence prediction, I would just modify the start-of-sequence token. For example: Use some reasonable model which takes the non-sequential context and creates a vector. Add that vector to the the learned start-of-sequence token vector. Future time steps will be able to attend to the start-of-sequence token as needed to retrieve the context.

If you're only using a transformer encoder, and not doing the autoregressive thing, I would just add an additional token to the input. I would most likely used a learned position encoding to add to that context vector rather than the normal sequential position embedding. Any time step will be able to attend to this special token and take advantage of the context clue you're providing.