r_linux_mod_isahoe t1_iyclnxu wrote on November 30, 2022 at 12:03 PM

no, it's AMD who fucked up. The whole ROCm is an afterthought. Hire a dev, make pytorch work on all modern AMD GPUs, then we'll talk. For now this is somehow a community effort.

serge_cell t1_iycp9ri wrote on November 30, 2022 at 12:42 PM

For that first AMD had to make normal implementation of OpenCL. People complain all the time - slowdowns, crashes, lack of portability. This going on for 10 years already and it doesn't get better.

Ronny_Jotten t1_iyd43te wrote on November 30, 2022 at 2:49 PM

> People replying with "don't bother, just use Nvidia&CUDA" only make the problem worse

No, they don't "only make it worse". It's good advice to a large proportion of people who just need to get work done. AMD/Intel need to hear that, and step up, by providing real, fully-supported alternatives, not leaving their customers to fool around with half-working CUDA imitations. ML is such an important field right now, and they've dropped the ball.

ReginaldIII t1_iycqo8c wrote on November 30, 2022 at 12:56 PM

> People replying with "don't bother, just use Nvidia&CUDA" only make the problem worse ...music for Nvidia's ears.

My job is to get a trained model out the door so we can run experiments.

My job is not revolutionize the frameworks and tooling available so that competing hardware can be made a feasible alternative for everyone.

There are only so many hours in the day. I get paid for a very specific job. I have to work within the world that exists around me right now.

philthechill t1_iycso44 wrote on November 30, 2022 at 1:14 PM

If you’re in a commercial setting your job is to get market-beating learning done at minimal cost. OP says these things might have a revolutionary cost per learning value, so yeah it is within your job parameters to look at specs, pricing and tool support at the very least. Ignoring technological revolutions is definitely one way companies end.

ReginaldIII t1_iyctmv8 wrote on November 30, 2022 at 1:23 PM

90% of the time my job is to be a small and consistently performing cog in a much bigger machine because I am there to help drive down stream science outcomes for other scientists (often in a different discipline).

We need to get X done within Y timeframe.

> "Lets consider upending our infrastructure and putting millions of pounds worth or existing and battle proven code and hardware up in flux so we can fuck around seeing if Intel has actually made a viable GPU-like product on their umpteenth attempt"

... is not exactly an easy sell to my board of governance.

I was in the first wave of people who got access to Xeon Phi Knights Corner Co-Processor cards. Fuck my life did we waste time on that bullshit. The driver support was abysmal, even with Intels own ICC compiler and their own MPI distribution.

philthechill t1_iyctys1 wrote on November 30, 2022 at 1:26 PM

Yeah fair.

ReginaldIII t1_iydnekt wrote on November 30, 2022 at 4:59 PM

Also worth considering how many years it is going to take to offset the sizeable cost of such a migration.

Forget the price of the hardware, how long is it going to take to offset the cost of the programming and administration labour to pull off this sort of move?

What about maintenance? We've got years of experience with Nvidia cards in datacentres, we understand the failure modes pretty well, we understand the tooling needed to monitor and triage these systems at scale.

What guarantees do I have that if I fill my racks with this hardware they won't be dying or catching on fire within a year?

What guarantees do I have that Intel won't unilaterally decide this is a dead cat for them and they want to scrap the project? Like they have for almost every GPU adjacent project they've had.

AtomKanister t1_iycydix wrote on November 30, 2022 at 2:04 PM

>"Lets consider upending our infrastructure and putting millions of pounds worth or existing and battle proven code and hardware up in flux so we can fuck around seeing if Intel has actually made a viable GPU-like product on their umpteenth attempt"

That's exactly how innovation is made, and missing out on this in crucial moments is how previously big players become irrelevant in the blink of an eye. See: Kodak, Blockbuster, Sears, Nokia.

It's valid to be skeptical of new developments (because a lot of them will be dead ends), but overdo it and you're setting yourself up for disaster.

hgoel0974 t1_iyczwvn wrote on November 30, 2022 at 2:17 PM

Setting up infrastructure that relies on a GPU that can't do what you need yet and is not optimized for it either is certainly innovative but not in the way that you're thinking.

ReginaldIII t1_iydl00l wrote on November 30, 2022 at 4:44 PM

> That's exactly how innovation is made

It's also how companies overextend and go out of business.

hgoel0974 t1_iycyx88 wrote on November 30, 2022 at 2:09 PM

For most users of ML frameworks, results take priority and there isn't much they can do about AMD's shit software and unreliable support. Plus even 4090s aren't really that expensive relative to what ML people make.

That said, Intel might actually be able to compete once their drivers have caught up. Unlike AMD, who seems to have systemic issues (not to mention fatal design flaws in ROCm in general), Intel just needs time because they clearly rushed the devices out before the drivers were fully ready.

slashdave t1_iydpou1 wrote on November 30, 2022 at 5:14 PM

Power costs dwarf hardware costs, by miles. Come up with an power efficient GPU, and we'll talk.

Does anyone uses Intel Arc A770 GPU for machine learning? [D]

trajo123 t1_iyciqd9 wrote on November 30, 2022 at 11:29 AM