BellyDancerUrgot t1_j7eq93o wrote on February 6, 2023 at 6:31 AM

Reply to comment by beautyofdeduction in Why does my Transformer blow GPU memory? by beautyofdeduction

Each Float64 is 4 bytes. U said u have 22M parameters.

Also besides ur params and activations u still have gradients + sequences are mapped for each attention head so multiply that by 8 as well.

For context I think deeplabv3 which iirc is a model with 58mil parameters was trained on 8 V100s.

Edit : I clearly had a brain stroke while writing the first part so ignore

beautyofdeduction OP t1_j7eqr8c wrote on February 6, 2023 at 6:37 AM

8 Bytes * 22M = 0.176 GB?

BellyDancerUrgot t1_j7f0u7u wrote on February 6, 2023 at 8:52 AM

Okay yeah Idk wtf I was typing. Yes 0.176gb for just the parameters. U still have to account for dense representations of long sequences, that too 8 times, activations, gradients and all these multiplied by the number of layers. There was a formula to approximate the value I read somewhere online. Activations I think take up way more memory than the model itself.

The memory requirement is roughly inline with most mid size transformer models I think.

beautyofdeduction OP t1_j7hkq74 wrote on February 6, 2023 at 9:13 PM

That context of how much memory other models use up is helpful. Thanks for taking the time to respond.