Viewing a single comment thread. View all comments

Avelina9X t1_jbt4o8y wrote

So the attention mechanism has N^2 space and time complexity relative to sequence length. However, if you are memory constrained it is possible to get the memory requirement per token down to O(N) by computing only 1 token at a time and caching the previous keys and values. This is only really possible at inference time and requires the architecture was implemented with caching in mind.

1