kevindamm t1_jbq3w44 wrote on March 10, 2023 at 9:12 PM

#2,203,828

The analysis isn't as straightforward as that, for a few reasons. Transformer architectures are typically a series of alternating Multi-Head Attention (MHA) and Multi-Layer Perceptron (MLP) networks. The MHA may merge the heads from multiple MLPs. Each layer in the network is dominated by a matrix multiply and if it were all being computed on a CPU then a reasonable upper bound would be O(n^3 ) where n is the widest layer. But the bottleneck isn't based on how many multiplies a CPU would have to do because we are typically using a GPU or TPU to process it and these can parallelize a lot of the additions and multiplies of the matrix ops. The real bottleneck is often the memory copies going to and from the GPU or TPU, and this will vary greatly based on the model size, GPU memory limits, batch processing size, etc.

You're better off profiling performance for a particular model and hardware combination.

Hostilis_ t1_jbqh1fm wrote on March 10, 2023 at 10:42 PM

#2,204,425

In terms of layer width, all operations within a single transformer layer are O(n^2 ), with n the width of the largest matrix in the layer. The architectures are sequential, so the contribution to complexity from depth is given by multiplying by d for depth. Finally, they are quadratic in context length c. So in total: O(n^2 d c^2 ).

There is generally not much difference between different transformer architectures in terms of the computational complexity.

appenz t1_jbqsu7k wrote on March 11, 2023 at 12:07 AM

#2,205,027

Both of the answers above are correct and if you care about the structure (i.e. depth, layers etc.) of the transformer it is complicated.

If you only care about scaling with the number of weights, most transformers scale with O(weights) and a generative transformer like GPT scales approximately with 2*weights.

multiverseportalgun t1_jbr55gh wrote on March 11, 2023 at 1:45 AM

#2,205,631

Replying to Hostilis_ (#2,204,425)

Quadratic 🤢

Hostilis_ t1_jbr5iul wrote on March 11, 2023 at 1:48 AM

#2,205,646

Replying to multiverseportalgun (#2,205,631)

Yeah quadratic scaling in context length is a problem lol. Hopefully RWKV will come to the rescue.

PassingTumbleweed t1_jbri1kj wrote on March 11, 2023 at 3:32 AM

#2,206,230

I won't repeat what other comments said but there are interesting architectures like H-Transformer that have lower asymptotic complexity and scale to longer sequences than the original Transformer. It's also worth noting that in practice the MLP cost may actually dominate the self-attention cost or vice versa, depending on the sequence length and model size.

Avelina9X t1_jbt4o8y wrote on March 11, 2023 at 2:43 PM

#2,208,478

So the attention mechanism has N^2 space and time complexity relative to sequence length. However, if you are memory constrained it is possible to get the memory requirement per token down to O(N) by computing only 1 token at a time and caching the previous keys and values. This is only really possible at inference time and requires the architecture was implemented with caching in mind.

fnbr t1_jd6jsb6 wrote on March 22, 2023 at 4:51 AM

#2,300,326

The best analysis on this is from this blog:

https://kipp.ly/blog/transformer-inference-arithmetic/

[D] What's the Time and Space Complexity of Transformer Models Inference?

Comments