For a VAE architecture and a dataset, say CIFAR-10, if the hidden/latent space is intentionally kept large to (say) 1000-d, I am assuming that the VAE will automatically not use the extra variables/dimensions in latent space which it does not need. The unneeded dimensions don't learn anything meaningful and therefore remain a standard, multivariate, Gaussian distribution. This serves as a signal that such dimensions can safely be removed without significantly impacting the model's performance.

I have implemented quite a few of them which can be referred to here.

Am I right with my hypothesis? Is there any research paper substantiating my hand wavy hypothesis?

Comments

You must log in or register to comment.

The_Sodomeister t1_iscenee wrote on October 14, 2022 at 9:57 PM

*If* your hypothesis is true (and I don't have enough direct experience with VAE to say for certain), then how would you distinguish those layers which are outputting approximately-Gaussian noise from the layers which are outputting meaningful signals? Whose to say that the meaningful signal doesn't also appear approximately Gaussian? Or at least sufficiently Gaussian to be not easily distinguishable from the others.

While I wouldn't go so far as to say that your hypothesis "doesn't happen", I also know from personal experience with other networks that NN models will tend to naturally over-parameterize if you let them. Regularization methods don't usually prevent the model from utilizing extra dimensions when it is able to, and it's not always clear whether the model could achieve the same performance with fewer dimensions vs. whether the extra dimensions are truly adding more representation capacity.

If some latent dimensions truly aren't contributing at all to the meaningful encoding, then I would think you could more likely identify this from looking at the weights in the decoder layers (as they wouldn't be needed to reconstruct the encoded input). I don't think this is as easy as it sounds, but I find it more plausible then determining this information strictly from comparing the distributions of the latent dimensions.

grid_world OP t1_ischbjf wrote on October 14, 2022 at 10:17 PM

I don’t think that the Gaussians are being output by a layer. In contrast with an Autoencoder, where a sample is encoded to a single point, in a VAE, due to the Gaussian prior, a sample is now encoded as a Gaussian distribution. This is the regularisation effect which enforces this distribution in the latent space. It cuts both ways, meaning that if the true manifold is not Gaussian, we still assume and therefore force it to be Gaussian.

A Gaussian signal being meaningful is something that I wouldn’t count on. Diffusion models are a stark contrast, but we aren’t talking about them. The farther a signal is away from a standard Gaussian, the more information it’s trying to smuggle through the bottleneck.

I didn’t get your point of looking at the decoder weights to figure out whether they are contributing? Do you compare them to their randomly initiated values to infer this?

The_Sodomeister t1_isci8nh wrote on October 14, 2022 at 10:23 PM

This is the part I'm referencing:

> The unneeded dimensions don't learn anything meaningful and therefore remain a standard, multivariate, Gaussian distribution. This serves as a signal that such dimensions can safely be removed without significantly impacting the model's performance.

How do you explicitly measure and use this "signal"? I don't think you'd get far by just measuring "farness away from Gaussian", as you'd almost certainly end up throwing away certain useful dimensions that may simply appear "Gaussian enough".

> I didn’t get your point of looking at the decoder weights to figure out whether they are contributing? Do you compare them to their randomly initiated values to infer this?

If the model reaches this "optimal state" where certain dimensions aren't contributing to the decoder output, then you should be able to detect this with some form of sensitivity analysis - i.e. changing the values in those dimensions shouldn't affect the decoder output, if those dimensions aren't being used.

This assumes that the model would correctly learn to ignore unnecessary latent dimensions, but I'm not confident it would actually accomplish that.

grid_world OP t1_isduwtm wrote on October 15, 2022 at 5:28 AM

Retraining the model with reduced dimensions would be a *rough* way of _proving_ this. But the stochastic behavior of neural networks makes this hard to achieve.

obsoletelearner t1_isfjs7i wrote on October 15, 2022 at 4:23 PM

!RemindMe 6 hours

RemindMeBot t1_isfjwuw wrote on October 15, 2022 at 4:24 PM

I will be messaging you in 6 hours on 2022-10-15 22:23:31 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

LuckyLuke87b t1_iswyslx wrote on October 19, 2022 at 9:12 AM

I fully agree with your idea and observed similar behavior. I'm not aware of literature regarding VAE, but I believe that there was quite some fundamental work beffore deep learning on pruning bayesian neural network weights based on the posterior entropy or "information length". Similalry I would consider this latent dimension selection as a way of pruning, based on how much information is represented.

grid_world OP t1_isx22tm wrote on October 19, 2022 at 10:00 AM

I have been running some experiments on toy datasets (MNIST, CIFAR-10) and for now it seems that very few of the latent variables z measured with mu and logvar vectors are almost never 0. And mathematically it makes sense since all of the latent variables will learn at least some information which is not garbage (standard Gaussian). So deciding the optimal number of latent space dimensionality is still eluding

LuckyLuke87b t1_it19qkj wrote on October 20, 2022 at 5:10 AM

Have you tried to generate samples by sampling from your latent space prior and feeding it to the decoder? In my experience it is often necessary to tune the weight of the KL-Loss such, that the decoder is a proper generator. Once this is done, some of the latent representations from the decoder get very close to the prior distributions, while other represent the relevant information. Next step is, to compare, if these relevant latent dimensions are the same on various encoded samples. Finally, prune all dimensions, which basically never differ from the prior up to some tolerance.