thebear96
thebear96 t1_iqr04o9 wrote
Reply to comment by PleaseKillMeNowOkay in Neural network that models a probability distribution by PleaseKillMeNowOkay
Ideally it should. In that case you will have a worse performance for the second architecture. When you compare you'll have to say that. But it's pretty expected that the second architecture will not perform as well as the first one, so I'm not sure if there's much use comparing. But it's definitely doable.
thebear96 t1_iqqykoz wrote
Reply to comment by PleaseKillMeNowOkay in Neural network that models a probability distribution by PleaseKillMeNowOkay
That shouldn't create a lot of difference but yes the performance should be worse than the first network in that case. It's far easier to predict two outputs than four. You can try increasing linear layers and using a slower learning rate to see if the model improves.
thebear96 t1_iqqxkb4 wrote
Reply to comment by PleaseKillMeNowOkay in Neural network that models a probability distribution by PleaseKillMeNowOkay
That's strange. It could be a data quantity issue. Bigger networks typically will need more data to perform well.
thebear96 t1_iqqwxur wrote
Reply to comment by PleaseKillMeNowOkay in Neural network that models a probability distribution by PleaseKillMeNowOkay
Is the loss decreasing enough after running for specified number of epochs? Are you getting a flat tail after convergence?
thebear96 t1_iqqsaoe wrote
Assuming same hyperparameters, the second network theoretically should converge to the solution quicker. So one will need to modify the hyperparameters and maybe add some dropouts so that the model doesn't overfit.
thebear96 t1_iqtxrnx wrote
Reply to comment by sydjashim in Neural network that models a probability distribution by PleaseKillMeNowOkay
Well I assumed that the network had more layers and so more parameters. More parameters can represent data much better and quicker. For example if you had a dataset with 30 features, and you use a Linear layer with 64 neurons, it should be able to represent each data point much quicker and easier than let's say a linear layer with 16 neurons. That's why I think the model would get converged quicker. But in OPs case his hidden layers are the same, only the output layer has more neurons. In that case we won't have a quick convergence.