Submitted by nateharada t3_10do40p in MachineLearning
Fit_Schedule5951 t1_j4obl4w wrote
Nice, I think an extension where this could be beneficial is when your process hangs - it's using full GPU memory but not training, this happened to me recently training models with fairseq. (I am not sure how you can catch these conditions)
nateharada OP t1_j4otocf wrote
This tool actually doesn't look at memory right now, just actual computation. Usually loading your model into memory eats up basically the max memory until the training is done, even if compute usage is very low.
If your training is hanging and still burning GPU cycles that'd be harder to detect I think.
bay_der t1_j4papbd wrote
One way I have figured out is to put a watch on the log file.
Viewing a single comment thread. View all comments