cerlestes

cerlestes t1_irm6pi8 wrote

> Consider an image encoding problem where two distinct images, image1 and image2, map to a same discrete embedding. Then their decoding will produce a single image. What have we gained by discretizing?

That's not quite true because of the U-net approach of most VQVAEs. The decoder does not only use the latent embedding space, but adds the various levels of encoder activations back into the decoded image. This means that the same embedding can produce different outputs for different inputs.

My team has found that because of this, a VQVAE is very good at reproducing details it has never seen before and can even reproduce image from domains it wasn't trained on. We trained a VQVAE on metal parts and it was capable of reproducing the Doge meme to very recognisable precision.

The advantage of using fixed embedding vectors rather than continous dimensions is that the decoder has a much easier time learning and reconstructing details present in the training data set since it has those fixed activation values to work with. As the encoder always "snaps" to the nearest embedding values, there is a lot less noise in the latent space for the decoder to work with. We found that a VQVAE has no problem with learning certain detailed patterns (e.g. noises present on metal surfaces) and reproducing them, whereas a regular VAE learns a more blurry function (e.g. it would just make the metal surface have an average gray color without any noise). I think this is due to the noise on that continous latent space vs. the fixed vector embedding space.

Another advantage of using embedding vectors is that you can use autoregressive models (and similiar) to repair parts of the image that are out of distribution. We use this for anomaly detection: using regular VAEs, we found that changes to the latent space dimensions can produce big, unwanted changes in the output, usually not localized to the anomalous region. When using VQVAEs, switching some embeddings generally preserves the overall output and only has a localized effect on the output, which is exactly what we want for anomaly detection. For example we can input an image of a damaged metal part, repair the anomalous embedding vectors and then decode it, to retrieve an image showing how that part should look like without the damage.

5