[deleted] t1_jacx9ai wrote on February 28, 2023 at 3:18 PM

That’s about x100 less than what I’d expected.

Beli_Mawrr t1_jad4r9n wrote on February 28, 2023 at 4:08 PM

That's almost in the realm of my computer can run it, no?

curiousshortguy t1_jad9s4t wrote on February 28, 2023 at 4:40 PM

it is, you can probably do 2 to 8 billion on your average gaming pc, and 16 on a high end one

AnOnlineHandle t1_jaeshwf wrote on February 28, 2023 at 10:30 PM

Is there a way to convert parameter count into vram requirements? Presuming that's the main bottleneck?

metal079 t1_jaeuymi wrote on February 28, 2023 at 10:47 PM

Rule of thumb is vram needed = 2x per billion parameters, though I recall pygamillion which is 6B says it needs 16GB of ram so it depends.

curiousshortguy t1_jaf3aab wrote on February 28, 2023 at 11:47 PM

Yeah, about 2-3. You can easily shove layers of the networks on disk, and then load even larger models that don't fit in vram BUT disk i/o will make inference painfully slow.

new_name_who_dis_ t1_jaf4lmy wrote on February 28, 2023 at 11:56 PM

Each float32 is 4 bytes.

[deleted] t1_jaeu7ev wrote on February 28, 2023 at 10:42 PM

[removed]

abnormal_human t1_jad6qae wrote on February 28, 2023 at 4:21 PM

Yeah, probably.

dancingnightly t1_jadj7fa wrote on February 28, 2023 at 5:40 PM

Edit: Seems like for this one yes. They do consider human instructions (similarish to the goal of a RLHF which requires more RAM), by adding them directly in the text dataset, as mentioned in 3.3 Language-Only Instruction Tuning-

For other models, like OpenAssistant coming up, one thing to note is that, although the generative model itself may be runnable locally, the reward model (the bit that "adds finishing touches" and ensures following instructions) can be much bigger. Even if the GPT-J underlying model is 11GB on RAM and 6B params, the RLHF could seriously increase that.

This models is in the realm of the smaller T5, BART and GPT-2 models released 3 years ago and runnable then on decent gaming GPUs

currentscurrents t1_jaetyg1 wrote on February 28, 2023 at 10:40 PM

Can't the reward model be discarded at inference time? I thought it was only used for fine-tuning.

[deleted] t1_jaejynm wrote on February 28, 2023 at 9:33 PM

[removed]

currentscurrents t1_jaetvbb wrote on February 28, 2023 at 10:39 PM

Definitely in the realm of running on your computer. Almost in the realm of running on high-end smartphones with TPUs.

[deleted] t1_jadkcqd wrote on February 28, 2023 at 5:47 PM

[deleted]

RetroPenguin_ t1_jad51qy wrote on February 28, 2023 at 4:10 PM

For the >10B closed source models, I’d be really curious how many of those weights are zero with fp16 precision.

7734128 t1_jaemc4b wrote on February 28, 2023 at 9:49 PM

Doesn't really change anything, does it? A zero still has an effect, so it has to be there, so I assume you mean that it could use less memory, right? But is that technically feasible to do in a practical manner? I can't imagine a practical way to have a tensor of split precision weights without ruinous reprocessing when trying to use the weights.

karius85 t1_jaeoyq7 wrote on February 28, 2023 at 10:06 PM

Sparse matrices, but you would need quite a lot of zeros.

pawsibility t1_jaep5s5 wrote on February 28, 2023 at 10:07 PM

> The MLLM component has 24 layers with 2,048 hidden dimensions, 8,192 FFN intermediate size, and 32 attention heads, resulting in about 1.3B parameters. We use Magneto’s initialization for optimization stability. For faster convergence, the image representation is obtained from a pretrained CLIP ViT-L/14 model with 1,024 feature dimensions. The images are preprocessed into 224×224 resolution during training. We freeze the parameters of the CLIP model except for the last layer during training. The total number of parameters of KOSMOS-1 is about 1.6B.

If they use CLIP to generate image representations/embeddings as input to their model, isn't that kind of cheating when reporting numbers of parameters? Or is CLIP sufficiently small, and that's how they jumped from 1.3B to 1.6B?

AnOnlineHandle t1_jaesse4 wrote on February 28, 2023 at 10:32 PM

The CLIP model in the Stable Diffusion 1.5 package is 480mb according to my directory where it was unpackaged by diffusers, though I don't know how that translate into parameter count.

MysteryInc152 OP t1_jacjpk9 wrote on February 28, 2023 at 1:38 PM

Yeah

[R] Microsoft introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot)

abnormal_human t1_jacjmrj wrote on February 28, 2023 at 1:37 PM