Submitted by Smooth-Earth-9897 t3_11nzinb in MachineLearning
kevindamm t1_jbq3w44 wrote
The analysis isn't as straightforward as that, for a few reasons. Transformer architectures are typically a series of alternating Multi-Head Attention (MHA) and Multi-Layer Perceptron (MLP) networks. The MHA may merge the heads from multiple MLPs. Each layer in the network is dominated by a matrix multiply and if it were all being computed on a CPU then a reasonable upper bound would be O(n^3 ) where n is the widest layer. But the bottleneck isn't based on how many multiplies a CPU would have to do because we are typically using a GPU or TPU to process it and these can parallelize a lot of the additions and multiplies of the matrix ops. The real bottleneck is often the memory copies going to and from the GPU or TPU, and this will vary greatly based on the model size, GPU memory limits, batch processing size, etc.
You're better off profiling performance for a particular model and hardware combination.
Viewing a single comment thread. View all comments