pawsibility t1_jaep5s5 wrote

> The MLLM component has 24 layers with 2,048 hidden dimensions, 8,192 FFN intermediate size, and 32 attention heads, resulting in about 1.3B parameters. We use Magneto’s initialization for optimization stability. For faster convergence, the image representation is obtained from a pretrained CLIP ViT-L/14 model with 1,024 feature dimensions. The images are preprocessed into 224×224 resolution during training. We freeze the parameters of the CLIP model except for the last layer during training. The total number of parameters of KOSMOS-1 is about 1.6B.

If they use CLIP to generate image representations/embeddings as input to their model, isn't that kind of cheating when reporting numbers of parameters? Or is CLIP sufficiently small, and that's how they jumped from 1.3B to 1.6B?