Comments

You must log in or register to comment.

cerlestes t1_irm6pi8 wrote

> Consider an image encoding problem where two distinct images, image1 and image2, map to a same discrete embedding. Then their decoding will produce a single image. What have we gained by discretizing?

That's not quite true because of the U-net approach of most VQVAEs. The decoder does not only use the latent embedding space, but adds the various levels of encoder activations back into the decoded image. This means that the same embedding can produce different outputs for different inputs.

My team has found that because of this, a VQVAE is very good at reproducing details it has never seen before and can even reproduce image from domains it wasn't trained on. We trained a VQVAE on metal parts and it was capable of reproducing the Doge meme to very recognisable precision.

The advantage of using fixed embedding vectors rather than continous dimensions is that the decoder has a much easier time learning and reconstructing details present in the training data set since it has those fixed activation values to work with. As the encoder always "snaps" to the nearest embedding values, there is a lot less noise in the latent space for the decoder to work with. We found that a VQVAE has no problem with learning certain detailed patterns (e.g. noises present on metal surfaces) and reproducing them, whereas a regular VAE learns a more blurry function (e.g. it would just make the metal surface have an average gray color without any noise). I think this is due to the noise on that continous latent space vs. the fixed vector embedding space.

Another advantage of using embedding vectors is that you can use autoregressive models (and similiar) to repair parts of the image that are out of distribution. We use this for anomaly detection: using regular VAEs, we found that changes to the latent space dimensions can produce big, unwanted changes in the output, usually not localized to the anomalous region. When using VQVAEs, switching some embeddings generally preserves the overall output and only has a localized effect on the output, which is exactly what we want for anomaly detection. For example we can input an image of a damaged metal part, repair the anomalous embedding vectors and then decode it, to retrieve an image showing how that part should look like without the damage.

5

elketefuka t1_irm6kzy wrote

Quantization is sometimes used to disentangle the relevant information from other information that is in the data which should not be encoded. An example: in speech processing it is used to separate the speech information (what is said) from the speaker information (who says it).

2

fenixfunkXMD5a t1_irlgz2c wrote

I am new to ML but maybe it's just because simpler systems are better. If the auto encoder is forced to have a smaller output size, you're forcing it to be more decisive when learning

1

akore654 t1_irosysj wrote

You're right, I think the only reason it works is that you have, for instance, a 16x16 grid of discrete latent vectors. For a 1024-way categorical distribution, it is highly unlikely that two images have the same grid of discrete latent vectors. The paper goes into more detail.

The advantage of the discretization is the ability to train an autoregressive model to train a prior over the categorical distribution. the new Parti text-image model is a recent example of this.

1

[deleted] OP t1_irozi6g wrote

[deleted]

1

akore654 t1_irpd2ai wrote

The 1024 is the number of latent vectors in the codebook. So the 16x16 grid would be something like [[5, 24, 16, 850, 1002, ...]], but as a 16x16 grid of any combination of 1024 discrete codes.

Exactly, the codes are conditioned against each other. It's exactly the same setup as the way GPT-3 and other autoregressive LLMs are trained, in their case the discrete codes are the tokenized word sequences. For images, just flatten the grid and predict the next discrete code.

I guess that's the main intuition of this method, to unify generative language modeling and image modeling to be a set of discrete codes, so that we can model them using the same methods.

1

[deleted] OP t1_irpyqxr wrote

[deleted]

1

akore654 t1_irsttia wrote

If we use the language analogy, if you had a sequence of 100 words. Each of those words would come from a vocabulary of a certain size (~50,000 for english) words. So for a sequence of 100 words you can chose for each position in the sequence, any of those 50,000 words.

You can see how this explodes in terms of the number of unique combinations. it is the same thing for the 16x16 grid with a vocabulary of 1024 possible discrete vectors.

I'm not entirely sure what motivates it, I just know it's a fairly successful method for text generation. Hope that helps.

1