Fit_Schedule5951 t1_j4obl4w wrote on January 17, 2023 at 2:55 AM

Nice, I think an extension where this could be beneficial is when your process hangs - it's using full GPU memory but not training, this happened to me recently training models with fairseq. (I am not sure how you can catch these conditions)

nateharada OP t1_j4otocf wrote on January 17, 2023 at 5:18 AM

This tool actually doesn't look at memory right now, just actual computation. Usually loading your model into memory eats up basically the max memory until the training is done, even if compute usage is very low.

If your training is hanging and still burning GPU cycles that'd be harder to detect I think.

bay_der t1_j4papbd wrote on January 17, 2023 at 8:35 AM

One way I have figured out is to put a watch on the log file.