Viewing a single comment thread. View all comments

appenz t1_jbqsu7k wrote

Both of the answers above are correct and if you care about the structure (i.e. depth, layers etc.) of the transformer it is complicated.

If you only care about scaling with the number of weights, most transformers scale with O(weights) and a generative transformer like GPT scales approximately with 2*weights.

4