Submitted by nateharada t3_10do40p in MachineLearning

Hey /r/machinelearning,

Long time reader, first time posting non-anonymously. I've been training models using various cloud services, but as an individual user it's stressful for me to worry about shutting down the instances if training fails or stops. Crashes, bad code, etc can cause GPU utilization to drop without the program successfully "finishing", and this idle time can cost a lot of money if you don't catch it quickly.

Thus, I built this tiny lil tool to help. It watches the GPU utilization of your instance, and performs an action if it drops too low for too long. For example, shutdown the instance if GPU usage drops under 30% for 5 minutes.

It's easy to use and install, just pip install gpu_sentinel

If this is useful please leave comments here or on the Github page: https://github.com/moonshinelabs-ai/gpu_sentinel

I'm hoping it helps save some other folks money!

86

Comments

You must log in or register to comment.

Zealousideal_Low1287 t1_j4n2ahm wrote

Looks nice. I probably wouldn’t use it for shutting down or anything, but a notification on failure might be useful!

5

nateharada OP t1_j4ne979 wrote

Nice! Right now you can use the end_process trigger to just return 0 when the trigger is hit from the process, but it should be fairly straightforward to externalize the API a little bit more. This would let you do something like this in your script:

from gpu_sentinel import Sentinel, get_gpu_usage
sentinel = Sentinel(
    arm_duration=10,
    arm_threshold=0.7,
    kill_duration=60,
    kill_threshold=0.7,
    kill_fn=my_callback_fn,
)
while True:
    gpu_usage = get_gpu_usage(device_ids=[0, 1, 2, 3])
    sentinel.tick(gpu_usage)
    time.sleep(1)

Is that something that would be useful? You can define the callback function yourself so maybe you trigger an alert, etc.

5

Zealousideal_Low1287 t1_j4neybb wrote

Yeah, that’s something which would be useful indeed. Don’t worry yourself though, I can put in a PR.

5

nateharada OP t1_j4ngy65 wrote

It's actually almost entirely ready now, I just need to alter a few things. I'll go ahead and push it soon! Need to do some final tests.

EDIT: The above code should work! See the README on the Github for a complete example.

6

MrAcurite t1_j4t9ch1 wrote

At work, we've got this thing that will notify you if a cloud instance has been running for 24 hours. However, it does this by messaging your work email, you can't configure it to go to a personal device or anything. Meaning, if you set a job to run at the end of the week, you can come back on Monday to over a thousand dollars of cloud charges and like fifty angry emails about it.

1

extracompute t1_j4tnogh wrote

>Give Award

Ha. computeX has automated notifs built in to avoid problems like these.

What's the biggest bill you've ever come back to on Monday AM?

1

Fit_Schedule5951 t1_j4obl4w wrote

Nice, I think an extension where this could be beneficial is when your process hangs - it's using full GPU memory but not training, this happened to me recently training models with fairseq. (I am not sure how you can catch these conditions)

2

nateharada OP t1_j4otocf wrote

This tool actually doesn't look at memory right now, just actual computation. Usually loading your model into memory eats up basically the max memory until the training is done, even if compute usage is very low.

If your training is hanging and still burning GPU cycles that'd be harder to detect I think.

4

bay_der t1_j4papbd wrote

One way I have figured out is to put a watch on the log file.

2

lorenzo1384 t1_j4r47aw wrote

Can I try this on colab

1

nateharada OP t1_j4tojyh wrote

Yeah it should work if you use the API (and if you have a GPU in your co-lab). I don't think it'll work with TPU just yet.

2

lorenzo1384 t1_j4trj9v wrote

Thanks and yes I do have a premium GPU. I am paying for all the proof of concepts i do. So this will be helpful.

1

ndemir t1_j4s5774 wrote

good idea.

1

Kinwwizl t1_j4slam2 wrote

That's one of the reasons GCP is nice for ML training workloads - you can kill VM after training is finished calling poweroff at the end of bash script for training.

1