Weary-Marionberry-15 t1_j3li7ao wrote on January 9, 2023 at 11:43 AM

#1,315,118

I don’t think this looks bad at all. I would probably push for A100 80gb gpu’s instead and the latest gen 64-core threadripper.

[deleted] t1_j3lijty wrote on January 9, 2023 at 11:47 AM

#1,315,133

[deleted]

joossss OP t1_j3lj1su wrote on January 9, 2023 at 11:53 AM

#1,315,152

Replying to Weary-Marionberry-15 (#1,315,118)

Thanks! The newest Threadrippers are still based on Zen 3. So, they don't support AVX512. Would definitely like to go with A100s, but we don't have the budget for that.

TrueBirch t1_j3mc7ua wrote on January 9, 2023 at 3:51 PM

#1,316,616

What made you decide to run an on-prem server instead of going to the cloud? I'm a data science manager and I'm currently looking at our options. I like self-hosting for most things, but I'm up in the air about training deep learning models.

deephugs t1_j3mt7re wrote on January 9, 2023 at 5:38 PM

#1,317,366

Replying to TrueBirch (#1,316,616)

Cloud is almost always better imo. At the small scale you can prototype quicker and spend less time messing with hardware by using cloud services. Once you actually need to scale your product then using a cloud solution makes it really easy. The "but its cheaper" argument gets less and less valid every year, and it often doesn't account for the time and effort spent setting up a local cluster.

rlvsdlvsml t1_j3n2it2 wrote on January 9, 2023 at 6:35 PM

#1,317,834

Replying to deephugs (#1,317,366)

If u use ray u can setup a gpu cluster in less than 30 min

deephugs t1_j3n3qwj wrote on January 9, 2023 at 6:42 PM

#1,317,890

Replying to rlvsdlvsml (#1,317,834)

I think Ray is great! But Ray will not click your GPUs into a motherboard, install linux on all the machines, setup nvidia-docker, power cycle if there are issues, periodically clear up space on hdds, etc. Its the non-software part of cluster management that ends up being the most annoying and time consuming.

rlvsdlvsml t1_j3nd87h wrote on January 9, 2023 at 7:39 PM

#1,318,351

Replying to deephugs (#1,317,890)

I have always felt like the network/security and integration with internal it systems was worse than the physical maintenance. Like people should expect that they have to invest time into integrating into a on-prem data center environment or physical maintenance stuff. I think small teams are benefited by a small gpu cluster with a fixed budget over large cloud gpu training costs. Mid-large companies do better with cloud than on-prem bc they can have better separation of environments but they cost more.

Cosmic_peach94 t1_j3nrjte wrote on January 9, 2023 at 9:06 PM

#1,319,145

As a recommendation I learned from a past job, use slurm or a similar program to make turns on the use of the gpu so you don’t end up dropping each other’s models

[deleted] t1_j3nwcyh wrote on January 9, 2023 at 9:35 PM

#1,319,392

[deleted]

learn-deeply t1_j3nx99l wrote on January 9, 2023 at 9:40 PM

#1,319,430

Are you looking to do distributed training across machines? Otherwise the NIC seems complete overkill.

joossss OP t1_j3q9m8r wrote on January 10, 2023 at 9:18 AM

#1,323,492

Replying to TrueBirch (#1,316,616)

The main reason for going to the cloud for us is that we are a research institution so, our funding is project-based meaning we have to use the funding in the allotted time and the second reason is that we already have the GPUs so the time it takes to pay itself off is faster.

joossss OP t1_j3q9qe0 wrote on January 10, 2023 at 9:19 AM

#1,323,496

Replying to Cosmic_peach94 (#1,319,145)

Thanks for the info! Was thinking on how to do that.

joossss OP t1_j3q9uip wrote on January 10, 2023 at 9:21 AM

#1,323,500

Replying to learn-deeply (#1,319,430)

Only this server is planned. I just went with the recommendation from NVIDIA's website, which stated 100 Gbps per A100, but I guess it makes more sense now that I think of distributed training. What NIC speed seems enough in that case?

learn-deeply t1_j3qdgsl wrote on January 10, 2023 at 10:12 AM

#1,323,630

Replying to joossss (#1,323,500)

10Gbps is more than sufficient, data loading from the internet is not the bottleneck. Most likely you'll have the data already stored on the machine itself. Btw why did you remove the post?

joossss OP t1_j3qnc04 wrote on January 10, 2023 at 12:14 PM

#1,324,042

Replying to learn-deeply (#1,323,630)

Yeah true and thanks :)

I did not remove it. Was removed by the moderators for some reason.

[D] Deep Learning Training Server

Comments