akore654

akore654 t1_irsttia wrote

If we use the language analogy, if you had a sequence of 100 words. Each of those words would come from a vocabulary of a certain size (~50,000 for english) words. So for a sequence of 100 words you can chose for each position in the sequence, any of those 50,000 words.

You can see how this explodes in terms of the number of unique combinations. it is the same thing for the 16x16 grid with a vocabulary of 1024 possible discrete vectors.

I'm not entirely sure what motivates it, I just know it's a fairly successful method for text generation. Hope that helps.

1

akore654 t1_irpd2ai wrote

The 1024 is the number of latent vectors in the codebook. So the 16x16 grid would be something like [[5, 24, 16, 850, 1002, ...]], but as a 16x16 grid of any combination of 1024 discrete codes.

Exactly, the codes are conditioned against each other. It's exactly the same setup as the way GPT-3 and other autoregressive LLMs are trained, in their case the discrete codes are the tokenized word sequences. For images, just flatten the grid and predict the next discrete code.

I guess that's the main intuition of this method, to unify generative language modeling and image modeling to be a set of discrete codes, so that we can model them using the same methods.

1

akore654 t1_irosysj wrote

You're right, I think the only reason it works is that you have, for instance, a 16x16 grid of discrete latent vectors. For a 1024-way categorical distribution, it is highly unlikely that two images have the same grid of discrete latent vectors. The paper goes into more detail.

The advantage of the discretization is the ability to train an autoregressive model to train a prior over the categorical distribution. the new Parti text-image model is a recent example of this.

1