Why does my Transformer blow GPU memory? Submitted by beautyofdeduction t3_10uuslf on February 6, 2023 at 2:12 AM in deeplearning 12 comments 6
ia3leonid t1_j7hgcoq wrote on February 6, 2023 at 8:46 PM Gradients are also stored and take as much memory as weights + activations, or more for some optimisers (Adam also tracks statistics for each weight, for example ) Permalink 1
Viewing a single comment thread. View all comments