Viewing a single comment thread. View all comments

deephugs t1_j3n3qwj wrote

I think Ray is great! But Ray will not click your GPUs into a motherboard, install linux on all the machines, setup nvidia-docker, power cycle if there are issues, periodically clear up space on hdds, etc. Its the non-software part of cluster management that ends up being the most annoying and time consuming.

4

rlvsdlvsml t1_j3nd87h wrote

I have always felt like the network/security and integration with internal it systems was worse than the physical maintenance. Like people should expect that they have to invest time into integrating into a on-prem data center environment or physical maintenance stuff. I think small teams are benefited by a small gpu cluster with a fixed budget over large cloud gpu training costs. Mid-large companies do better with cloud than on-prem bc they can have better separation of environments but they cost more.

3