Viewing a single comment thread. View all comments

Thrawn89 t1_j6myw1u wrote

The explanation you are replying to is completely wrong. GPUs haven't been optimized for vector math since like 20 years ago. They all operate on what's called a SIMD architecture, which is why they can do this work faster.

In other words, they can do the exact same calculations as a CPU, except they run each instruction on like 32 shader instances at the same time. They also have multiple shader cores.

The Nvidia cuda core count they give is this 32*number of shader cores. In other words, how many parallel ALU calculations they can do simultaneously. For example the 4090 has 16384 cuda cores so they can do 512 unique instructions on 32 pieces of data each.

You CPU can do maybe 8 unique instructions on a single piece of data each.

In other words, GPUs are vastly superior when you need to run the same calculations on many pieces of data. This fits well with graphics where you need to shade millions of pixels per frame, but it also works just as well for say calculating physics on 10000 particles at the same time or simulating a neural network with many neurons.

CPUs are better at calculations that only need to be done on a single piece of data since they are clocked higher and no latency to setup.

2

Zironic t1_j6nh61n wrote

>You CPU can do maybe 8 unique instructions on a single piece of data each.

A modern CPU core can run 3 instructions per cycle on 512 bits of data, making each core equivalent to about 96 basic shaders. Even so you can see how even a 20 core CPU can't keep up with even a low end GPU in raw parallel throughput.

>CPUs are better at calculations that only need to be done on a single piece of data since they are clocked higher and no latency to setup.

The real benefit isn't the clockrate, if that was the main difference we wouldn't be using CPU's anymore because they're not that far apart.

What CPU's have which GPUs do not is branch prediction and very very advanced data pipelines and instruction queue's which allow per-core performance a good order of magnitude better then a shader for anything that involves branches.

1

Thrawn89 t1_j6nkd1y wrote

True, SIMD is absolutely abysmal at branches since it needs to take both true and false cases for the entire wave (usually). There are optimizations that GPUs do so it's not always terrible though.

It sounds like you're discussing vector processing instruction set with 512 bits which are very much specialized for certain tasks such as memcpy and not much else? That's just an example of a small SIMD on the CPU.

1

Zironic t1_j6nzase wrote

>It sounds like you're discussing vector processing instruction set with 512 bits which are very much specialized for certain tasks such as memcpy and not much else? That's just an example of a small SIMD on the CPU.

The vector instruction set is primarily for floating point math but also does integer math. It's only specialized for certain tasks in so far those certain tasks are SIMD, it takes advantage of the fact that doing a math operation across the entire memory of the CPU is as fast as doing it on just a single word.

In practice most programs don't lend themselves to vectorisation so it's mostly used for physics simulations and the like.

1