Viewing a single comment thread. View all comments

Zealousideal_Low1287 t1_j4n2ahm wrote

Looks nice. I probably wouldn’t use it for shutting down or anything, but a notification on failure might be useful!

5

nateharada OP t1_j4ne979 wrote

Nice! Right now you can use the end_process trigger to just return 0 when the trigger is hit from the process, but it should be fairly straightforward to externalize the API a little bit more. This would let you do something like this in your script:

from gpu_sentinel import Sentinel, get_gpu_usage
sentinel = Sentinel(
    arm_duration=10,
    arm_threshold=0.7,
    kill_duration=60,
    kill_threshold=0.7,
    kill_fn=my_callback_fn,
)
while True:
    gpu_usage = get_gpu_usage(device_ids=[0, 1, 2, 3])
    sentinel.tick(gpu_usage)
    time.sleep(1)

Is that something that would be useful? You can define the callback function yourself so maybe you trigger an alert, etc.

5

Zealousideal_Low1287 t1_j4neybb wrote

Yeah, that’s something which would be useful indeed. Don’t worry yourself though, I can put in a PR.

5

nateharada OP t1_j4ngy65 wrote

It's actually almost entirely ready now, I just need to alter a few things. I'll go ahead and push it soon! Need to do some final tests.

EDIT: The above code should work! See the README on the Github for a complete example.

6

MrAcurite t1_j4t9ch1 wrote

At work, we've got this thing that will notify you if a cloud instance has been running for 24 hours. However, it does this by messaging your work email, you can't configure it to go to a personal device or anything. Meaning, if you set a job to run at the end of the week, you can come back on Monday to over a thousand dollars of cloud charges and like fifty angry emails about it.

1

extracompute t1_j4tnogh wrote

>Give Award

Ha. computeX has automated notifs built in to avoid problems like these.

What's the biggest bill you've ever come back to on Monday AM?

1