rantana t1_j32bo5x wrote on January 5, 2023 at 4:04 PM

#1,287,899

128GB HBM would fit some serious models on a single device. But I have yet to see any real progress from AMD (something that I can buy) that would make me consider changing workflow away from nvidia hardware.

AlmightySnoo t1_j32iljg wrote on January 5, 2023 at 4:48 PM

#1,288,257

Replying to rantana (#1,287,899)

PyTorch 2.0 moving away from directly depending on CUDA and using instead Triton is good news for AMD. In the Triton Github repo they say that AMD GPU support is under development. AMD needs to invest some resources to help there.

geeky_username t1_j32l1cl wrote on January 5, 2023 at 5:02 PM

#1,288,396

Replying to AlmightySnoo (#1,288,257)

>AMD needs to invest some resources to help there.

That's where it will fail

Nhabls t1_j32q6zp wrote on January 5, 2023 at 5:33 PM

#1,288,680

The "monopoly" is from the ecosystem mostly, not the hardware itself. Practicioners and researchers have a much better time using consumer/entry level professional nvidia hardware. So they use nvidia.

Mind you that in the supercomputer level there is no real "monopoly" as those people just develop their solutions from the ground up.

Nhabls t1_j32qdjm wrote on January 5, 2023 at 5:34 PM

#1,288,694

Replying to AlmightySnoo (#1,288,257)

AMD solutions have been in "development" for as long as i've been in contact with the space. The approaches rise and fall but never deliver fully. Maybe it'll be different in the future, who knows

AlmightySnoo t1_j32qxge wrote on January 5, 2023 at 5:38 PM

#1,288,728

Replying to Nhabls (#1,288,694)

Because AMD never goes all in in software. Hope that view will probably change with Victor Peng and $AMD starts throwing billions into software.

AlmightySnoo t1_j32s8ve wrote on January 5, 2023 at 5:45 PM

#1,288,806

Replying to Nhabls (#1,288,680)

Also this. $AMD still makes it explicit that they officially support Rocm only on CDNA GPUs, and even then it's only under Linux. That's an immediate turn off for lots of beginner GPGPU programmers who'll immediately flock to CUDA as it works with any not too old gaming GPU from Nvidia. It's astonishing how Lisa Su still hasn't realized the gravity of this blunder.

samobon OP t1_j32uicn wrote on January 5, 2023 at 5:59 PM

#1,288,926

Replying to AlmightySnoo (#1,288,806)

I agree with you both that until small academic labs are able to use entry level GPUs for research, there will not be a mass adoption.

wywywywy t1_j32xwg5 wrote on January 5, 2023 at 6:19 PM

#1,289,101

Replying to samobon (#1,288,926)

Saying that though, it's nice to see AMD trying. The monopoly is extremely unhealthy.

ZaZaMood t1_j32zcfc wrote on January 5, 2023 at 6:27 PM

#1,289,169

Replying to geeky_username (#1,288,396)

Tired of reddit cynics. Saying something will fail before it even starts

geeky_username t1_j331xz0 wrote on January 5, 2023 at 6:42 PM

#1,289,290

Replying to ZaZaMood (#1,289,169)

"Those that fail to learn from history are doomed to repeat it."

Especially on the software side, AMD has a habit of releasing something and then not doing much for continued support, expecting the community to foot the labor

rlvsdlvsml t1_j334v18 wrote on January 5, 2023 at 6:59 PM

#1,289,445

Replying to ZaZaMood (#1,289,169)

Rocm users have been failed for the past 3 years tho

ApprehensiveNature69 t1_j337qjg wrote on January 5, 2023 at 7:16 PM

#1,289,573

Replying to rantana (#1,287,899)

Rocm works pretty well these days on my 6900xt?

zeyus t1_j338cuu wrote on January 5, 2023 at 7:20 PM

#1,289,596

Replying to geeky_username (#1,289,290)

Isn't their continued support one of the selling points for AM5? That they supported previous gen for ages and they plan to again

geeky_username t1_j33cic6 wrote on January 5, 2023 at 7:45 PM

#1,289,820

Replying to zeyus (#1,289,596)

Software.

Having AI compute hardware is rather pointless without the supporting software.

Nvidia has an entire CUDA ecosystem for developers to use

AlmightySnoo t1_j33dyf3 wrote on January 5, 2023 at 7:53 PM

#1,289,900

Replying to ApprehensiveNature69 (#1,289,573)

But there is no official support for your card.

ReginaldIII t1_j33ff9r wrote on January 5, 2023 at 8:02 PM

#1,289,964

Replying to Nhabls (#1,288,680)

Except there is an ecosystem monopoly at the cluster level too because some of the most established, scalable, and reliable software (like those used in fields like bio-informatics as an example) only provide CUDA implementations of key algorithms and being able to accurately reproduce results computed by them is vital.

This essentially limits those software to only running on large CUDA clusters. You can't reproduce the results without the scale of a cluster.

Consider software for processing Cryo-Electron Microscopy and Ptychography data. Very very few people are actually "developing" those software packages, but thousands of researchers around the world are using them at scale to process their micrographs. Those microscopists are not programmers, or really even cluster experts, and they just don't have the skillsets to develop on these code bases. They just need it work reliably and reproducibly.

I've been working in HPC on a range of large scale clusters for a long time. There has been a massive and dramatic demographic shift in terms of the skillsets that our cluster users have. A decade ago you wouldn't dream of letting someone not a HPC expert anywhere near your cluster. If a team of non-HPC people needed HPC you'd hire HPC experts into your team to handle that for you and tune the workloads onto the cluster and develop the code to make it work best. Now we have an environment where non-HPC people can pay for access and run their workloads directly because they leverage these pre-tinned software packages.

ApprehensiveNature69 t1_j33hkaz wrote on January 5, 2023 at 8:14 PM

#1,290,072

Replying to AlmightySnoo (#1,289,900)

For an individual that’s pretty true of any card - Nvidia will probably ignore your random CUDA error and redirect you to the forums to figure it out wether it is a k80 or an H100.

hateboresme t1_j33meqx wrote on January 5, 2023 at 8:43 PM

#1,290,285

It will not be long until we see these chips designed by ai specifically designed to design chips for the purpose of designing super efficient chips for designing chips. This is it...the chips designing the chips to design the chips. Singularity here we come.

zeyus t1_j33nspg wrote on January 5, 2023 at 8:51 PM

#1,290,349

Replying to geeky_username (#1,289,820)

Absolutely agree, it's been a while since I've had AMD hardware, but I'd consider it again (especially CPU)...I just haven't been aware of specific issues with software either, I mean Intel, AMD and Nvidia all have had bugfixes and patching with drivers and firmware. Is there something I've missed about AMD and software?

BTW, I haven't had enough disposable income to upgrade so I've been stuck on 4590K for about 6 years and I hate my motherboard software (that's Asus bloatware) and had so much trouble getting the NVMe to work and RAID...but once I did it's been OK, and the 1070 I have is getting a bit to small for working with ML/AI, but what can you do...it still runs most newish games too.

allenout t1_j33v08p wrote on January 5, 2023 at 9:34 PM

#1,290,770

Replying to AlmightySnoo (#1,288,728)

Wierdly enough, Xilinx is a huge investor in software and has absolutely amazing software support and customer service. I hope that translates over to AMD.

learn-deeply t1_j342462 wrote on January 5, 2023 at 10:17 PM

#1,291,157

Replying to ApprehensiveNature69 (#1,289,573)

What models have you tried? Wonder what the gaps between CUDA and ROCm are.

ApprehensiveNature69 t1_j342gcc wrote on January 5, 2023 at 10:19 PM

#1,291,180

Replying to learn-deeply (#1,291,157)

So far a lot of them - have not had any issues with various stable diffusion models, deolidfy, bloom3b, and basically anything I have tried

memberjan6 t1_j34r3wo wrote on January 6, 2023 at 12:58 AM

#1,292,287

Cerebus is possibly ending NVIDIA and AMd. Those two are tied to an older design which was good for a while, but has run its course and is now on decline.

geeky_username t1_j358k76 wrote on January 6, 2023 at 2:59 AM

#1,293,036

Replying to zeyus (#1,290,349)

>Is there something I've missed about AMD and software?

They have this https://gpuopen.com/

Which seems great in theory, but some of that hasn't been touched in a long time.

Radeon Rays: May 2021

They'll release something, do a bunch of initial work on it, and then it fades

HippoLover85 t1_j35en7z wrote on January 6, 2023 at 3:45 AM

#1,293,273

Replying to geeky_username (#1,289,290)

Previously amd didnt have the budget for it. They do now and have really only had it the last two-ish years.

Will they now put resources towards it? I hope so. But it also appears amd is trying to get products in mega dc/supercomputer applications and spreading use that way.

scraper01 t1_j35qkv9 wrote on January 6, 2023 at 5:28 AM

#1,293,774

Nope. I develop free time for AMD chipsets. Inferior performance than Nvidia all over the place, and the support sucks ass. Prepare to fix the 'supported' libraries yourself.

zeyus t1_j363mxu wrote on January 6, 2023 at 7:54 AM

#1,294,219

Replying to geeky_username (#1,293,036)

Well that is a genuine shame, nvidia really needs some competition in this space. I'm sure plenty of researchers and enthusiasts would happily use some different hardware (as long as porting was easy) I've written some CUDA C++ and it's not bad. Manufacturer-specific code always feels a bit gross, but the GPU agent based modeling framework I was using was strictly CUDA.

zepmck t1_j364mra wrote on January 6, 2023 at 8:07 AM

#1,294,240

AMD should seriously invest in developing a credible software stack rather hyping new chips.

visarga t1_j36bpnu wrote on January 6, 2023 at 9:42 AM

#1,294,409

Replying to AlmightySnoo (#1,288,806)

Maybe it was intended to keep AMD on a different path than NVIDIA. It looks incredibly stupid not to hop on the AI wave.

visarga t1_j36by5o wrote on January 6, 2023 at 9:46 AM

#1,294,418

Replying to hateboresme (#1,290,285)

No, it's train models to solve problems to make more data to train models. That's how it will go.

b3081a t1_j375hbo wrote on January 6, 2023 at 2:42 PM

#1,295,487

Replying to AlmightySnoo (#1,288,806)

They've added official support for Navi21 under Linux some time ago. It's still very small number of supported devices comparing to NVIDIA but at least it's no longer required to purchase CDNA accelerators to get started.

ZaZaMood t1_j39c7ot wrote on January 6, 2023 at 10:50 PM

#1,298,776

Replying to zeyus (#1,294,219)

Nvidia needs some competition fr fr. I can't even consider buying AMD because the entire data science community has pinned to CUDA

lostmsu t1_j4j95e9 wrote on January 16, 2023 at 2:36 AM

#1,373,118

Replying to ApprehensiveNature69 (#1,289,573)

Can you bench training with https://github.com/karpathy/nanoGPT and 100M+ GPT model?

[deleted] t1_j5jsep7 wrote on January 23, 2023 at 2:47 PM

#1,451,399

Replying to samobon (#1,288,926)

[deleted]

limb3h t1_j5lyue0 wrote on January 23, 2023 at 11:07 PM

#1,458,743

How many APUs can be connected together via IF? Hopefully they can do 8-16 to challenge DGX.

limb3h t1_j5lzdx6 wrote on January 23, 2023 at 11:11 PM

#1,458,800

Replying to memberjan6 (#1,292,287)

Cerebras is pretty well suited for large language models like GPT3. Their latest generation product can be clustered easily to train huge models. I wouldn't say they're ending AMD and NVDA though, but in order for huge language models to be democratized, some disruptive technologies have to happen. No one other than whales today can afford to train GPT3.

limb3h t1_j5m01yl wrote on January 23, 2023 at 11:16 PM

#1,458,861

Replying to zepmck (#1,294,240)

They're trying pretty hard, but Nvidia has spent thousands of man years on this stuff and built ecosystem and community around it. It's not easy. Plus it's hard for AMD to hire the best software folks.

kanink007 t1_j638gwv wrote on January 27, 2023 at 11:58 AM

#1,525,759

Replying to AlmightySnoo (#1,288,257)

Any Info about AMD APUs? By now I gave up hoping for AMD making ROCm available for APUs. I dont know much about Triton: Does it support APUs like 5600g?

Comments