geneing t1_jd6016n wrote on March 22, 2023 at 1:55 AM

Is this a workaround for the weird Cerebras chip architecture? Would mainstream users who train on GPU benefit?

CS-fan-101 OP t1_jd649ko wrote on March 22, 2023 at 2:28 AM

I wouldn't call it a workaround but rather an advantage.

Neural network models are made up of layers of neurons and connections between them. When there are missing connections, represented as zeros in the weight matrices, we refer to the model as sparse.

Sparsity comes in different forms. It is common for sparsity to occur naturally in the model structure itself if the pattern of connections is designed to only connect a subset of the neurons. Often, models are constructed this way intentionally with a predefined pattern and we refer to this as structured sparsity.

It turns out that even fully dense models, such as GPT, can be made sparse by inducing unstructured sparsity. In this form of sparsity, certain weights are set to zero, which effectively prunes the connections within the model. When the pruning is done without a fixed pattern, we refer to this as unstructured sparsity.

A key benefit of unstructured sparsity is the model retains the original baseline structure, without the need to create a new model architecture. Additionally, the sparse model can provide speedup in both training and inference.

The Cerebras CS-2 is designed to accelerate unstructured sparsity, whereas GPUs are not.

If you are interested in learning more, please check out our blog - https://www.cerebras.net/blog/harnessing-the-power-of-sparsity-for-large-gpt-ai-models

maizeq t1_jd6kpnj wrote on March 22, 2023 at 5:01 AM

> The Cerebras CS-2 is designed to accelerate unstructured sparsity, whereas GPUs are not.

Don’t modern NVIDIA GPUs (2000s+) have strong support for sparsity (maximum theoretical flops are doubled when doing sparse computation?). From their documentation the type of sparsity they support is also unstructured (e.g randomly pruned values in tensors). Does the Cerebras chip have higher sparse flops, or does the comparison not make sense?

artsybashev t1_jd6l85h wrote on March 22, 2023 at 5:07 AM

nvidia has structured sparsity

maizeq t1_jd6u4kb wrote on March 22, 2023 at 7:03 AM

The sparsity they describe in this link entails randomly pruning weights (i.e. not specific channels like depthwise convolutions), which is what Graphcore is calling "unstructured".

osdd_alt_123 t1_jd6ufjz wrote on March 22, 2023 at 7:07 AM

Nvidia has 2:4 structured sparsity in the Ampere architecture and one or two below as well, if memory serves.

So in a block of 4, you have to have 2 dropped and 2 retained. It's how they claim their 2x throughput at the hardware level.

You can, however, emulate sparsity in a variety of other ways that are higher than the hardware level. Hope this helps.

maizeq t1_jd76a7x wrote on March 22, 2023 at 9:57 AM

Ah I see, thank you for the clarification.

brownmamba94 t1_jd8lqry wrote on March 22, 2023 at 4:46 PM

Also, the N:M sparsity structure is much more constrained in terms of mask diversity compared to unstructured sparsity. In Table 1 in the N:M Transposable sparsity paper, they present the mask diversity constraint between different sparsity techniques (both unstructured and structured), and as expected unstructured sparsity achieves the best. I think this is important especially for dynamic sparse training because now the algorithm has a much larger search space to explore sparse subnetworks. Also, imposing structured sparsity like N:M sparsity tends to reduce the expressivity of a weight matrix at higher sparsity levels, which can be a constraint if you want to get high compression ratios.