Viewing a single comment thread. View all comments

nateharada OP t1_j4otocf wrote

This tool actually doesn't look at memory right now, just actual computation. Usually loading your model into memory eats up basically the max memory until the training is done, even if compute usage is very low.

If your training is hanging and still burning GPU cycles that'd be harder to detect I think.

4

bay_der t1_j4papbd wrote

One way I have figured out is to put a watch on the log file.

2