I have a neural network whose outputs are the parameters of a probability distribution. I have another neural network whose outputs are the parameters of a probability distribution with a more general covariance structure than the first one (It can be reduced to the first pdf). Is my second network going to perform at least as well as my first network?

Apologies for the vague description. I am not sure how much I'm allowed to talk about it. Literally, any help is appreciated.

Comments

You must log in or register to comment.

thebear96 t1_iqqsaoe wrote on October 2, 2022 at 12:50 PM

Assuming same hyperparameters, the second network theoretically should converge to the solution quicker. So one will need to modify the hyperparameters and maybe add some dropouts so that the model doesn't overfit.

PleaseKillMeNowOkay OP t1_iqqwpem wrote on October 2, 2022 at 1:28 PM

That's what I thought but I haven't been able to get the second model to even match the performance of the first one. I tried regularization methods without much success.

thebear96 t1_iqqwxur wrote on October 2, 2022 at 1:30 PM

Is the loss decreasing enough after running for specified number of epochs? Are you getting a flat tail after convergence?

PleaseKillMeNowOkay OP t1_iqqxd7o wrote on October 2, 2022 at 1:34 PM

Yes, I trained until the validation loss stopped improving, and then some more just to make sure.

thebear96 t1_iqqxkb4 wrote on October 2, 2022 at 1:35 PM

That's strange. It could be a data quantity issue. Bigger networks typically will need more data to perform well.

PleaseKillMeNowOkay OP t1_iqqxw6h wrote on October 2, 2022 at 1:38 PM

I wouldn't call it a bigger network necessarily. The second network has two more output neurons compared to the first. Rest are the same. How much difference that makes. Idk

thebear96 t1_iqqykoz wrote on October 2, 2022 at 1:43 PM

That shouldn't create a lot of difference but yes the performance should be worse than the first network in that case. It's far easier to predict two outputs than four. You can try increasing linear layers and using a slower learning rate to see if the model improves.

PleaseKillMeNowOkay OP t1_iqqz3lp wrote on October 2, 2022 at 1:47 PM

I could add more linear layers and based on my experiments it would probably help but my intention is to compare my new model with the old one for which I presume the architecture should be as close as possible.

thebear96 t1_iqr04o9 wrote on October 2, 2022 at 1:55 PM

Ideally it should. In that case you will have a worse performance for the second architecture. When you compare you'll have to say that. But it's pretty expected that the second architecture will not perform as well as the first one, so I'm not sure if there's much use comparing. But it's definitely doable.

sydjashim t1_iqtuqdt wrote on October 3, 2022 at 1:17 AM

Can you reason out why the model will get converged quicker ?

thebear96 t1_iqtxrnx wrote on October 3, 2022 at 1:41 AM

Well I assumed that the network had more layers and so more parameters. More parameters can represent data much better and quicker. For example if you had a dataset with 30 features, and you use a Linear layer with 64 neurons, it should be able to represent each data point much quicker and easier than let's say a linear layer with 16 neurons. That's why I think the model would get converged quicker. But in OPs case his hidden layers are the same, only the output layer has more neurons. In that case we won't have a quick convergence.

UsernameRelevant t1_iqq5hp9 wrote on October 2, 2022 at 8:15 AM

> Is my second network going to perform at least as well as my first network?

Impossible to say. In general, more parameters mean that you can get a better fit, but also that the model overfits more easily.

Why don’t you compare the models on a test set?

PleaseKillMeNowOkay OP t1_iqqw34u wrote on October 2, 2022 at 1:23 PM

I did. The second model performed worse. I didn't think that was possible.

SimulatedAnnealing t1_iqs94b6 wrote on October 2, 2022 at 6:48 PM

The most plausible explanation is overfitting. How do they compare in terms of error in the train set?

PleaseKillMeNowOkay OP t1_iqscxo9 wrote on October 2, 2022 at 7:11 PM

The simpler model had lower training loss with the same number of epochs. I tried training the second model until it had the same training loss as the first model, which took much longer. The validation did not improve and had a slight upward trend, which I know means that it's overfitting.

sydjashim t1_iqtu9hp wrote on October 3, 2022 at 1:13 AM

Did you keep same initial weights for both the networks ?

PleaseKillMeNowOkay OP t1_iqu4rs2 wrote on October 3, 2022 at 2:35 AM

Same initialization but not the exact weights. However, I've run the experiments enough times with the same result for me to be sure that the initial weights aren't an issue.

sydjashim t1_ique162 wrote on October 3, 2022 at 3:55 AM

I have got a quick guess here.. maybe can be of help to you.. take the n-1 layers weights of your first learned model (trained weights) then try finetuning with the 4 outputs and observe either your validation loss is improving.

If so, then later you can take the untrained initial weights of your first model (till n-1th layer) then trying converging them with 4 outputs. This step is mentioned such that you have got a model started training from scratch for 4 outputs but having the same initial weights for both the models.

Why am i saying this ?

Well. I think you could try in this way since you expect to keep maximum params esp. model parameters (weights) similar while running the comparision between them.

PleaseKillMeNowOkay OP t1_iqufvaa wrote on October 3, 2022 at 4:14 AM

This seems interesting. I'll give this a shot. Thanks!

WhizzleTeabags t1_iqq59i6 wrote on October 2, 2022 at 8:12 AM

It can perform worse, the same or better