Viewing a single comment thread. View all comments

_Arsenie_Boca_ t1_jd6u2my wrote

First time I hear sparse pretraining and dense finetuning. Usually its the other way around right? So that you get faster inference speeds. Is it correct that you are aiming for faster pretraining through sparsity here, while having normal dense inference speeds?

Also, could you provide an intuition on how cerebras is able to translate unstructured sparsity to speedups? Since you pretrained a 1.3B model, I assume it runs on GPU, unlike DeepSparse?

3

brownmamba94 t1_jd6xm1n wrote

Yes, that's right, usually it's the other way around and that's usually because for the average researcher its computationally expensive to pre-train the LLM from scratch. So, they often typically take existing pre-trained LLM checkpoints and perform fine-tuning on them on a domain specific task. The FLOPs required for pre-training is several orders of magnitude more FLOPs than fine-tuning.

In this work, like you said, we're aiming to show that thanks to the Cerebras CS-2, we can achieve faster pre-training with unstructured weight sparsity, and fine-tune dense to recover the performance on the downstream task. The ability to do faster pre-training opens up a lot of potential for new directions in LLM research. Note that an interesting extension of our work is to do sparse pre-training followed by parameter efficient fine-tuning using techniques like LoRA from Microsoft.

There's actually a couple really nice blogs from Sean Lie, our Co-founder and Chief Hardware Architect, discussing how the Cerebras CS-2 can translate unstructured sparsity to realized gains unlike traditional GPUs. All the experiments in our paper were done on the CS-2, including the 1.3B GPT-3 XL. There was no GPU training here. I encourage you to check out these blogs:

Harnessing the Power of Sparsity for Large GPT AI ModelsCerebras Architecture Deep Dive: First Look Inside the HW/SW Co-Design for Deep Learning

4