beautyofdeduction OP t1_j7jqohn wrote on February 7, 2023 at 7:56 AM

Reply to comment by neuralbeans in Why does my Transformer blow GPU memory? by beautyofdeduction

I wish I can send you my Github. But the original Attention is All You Need paper trained on sequences of length 25000 on multiple K80's (stated by the authors), which has only 12GB vram. Yes they used multiple GPUs, but afaik each GPU needs to be able to handle its own batch. Or maybe not? Again I wish I could show you my code.

beautyofdeduction OP t1_j7hkq74 wrote on February 6, 2023 at 9:13 PM

Reply to comment by BellyDancerUrgot in Why does my Transformer blow GPU memory? by beautyofdeduction

That context of how much memory other models use up is helpful. Thanks for taking the time to respond.

beautyofdeduction OP t1_j7hkb7q wrote on February 6, 2023 at 9:11 PM

Reply to comment by neuralbeans in Why does my Transformer blow GPU memory? by beautyofdeduction

Yes, that's true. But even adding that in (6250*6250 ~= 40 mil floats), we are still nowhere near 40G.

beautyofdeduction OP t1_j7eqr8c wrote on February 6, 2023 at 6:37 AM

Reply to comment by BellyDancerUrgot in Why does my Transformer blow GPU memory? by beautyofdeduction

8 Bytes * 22M = 0.176 GB?

beautyofdeduction OP t1_j7epm3u wrote on February 6, 2023 at 6:23 AM

Reply to comment by BellyDancerUrgot in Why does my Transformer blow GPU memory? by beautyofdeduction

Can you elaborate?