suflaj t1_j3bt2eq wrote on January 7, 2023 at 12:49 PM

That learning rate is about 100 times higher than you give to Adam for that batch size. That weight decay is also about 100 times higher, and if you want to use weight decay with Adam, you should probably use the AdamW optimizer (which is more or less the same thing, just fixes the interaction between Adam and weight decay)

Also, loss is not something that determines how much a model has learned. You should check out validation F1, or whatever metrics are relevant for the performance of your model.

AKavun OP t1_j3btlem wrote on January 7, 2023 at 12:54 PM

I also have a validation accuracy metric of around %50 which is basically the expected value of a random variable.

I removed the weight decay to keep things simpler and adjusted the learning rate to 0.0003. I will update this thread on the results.

Thank you for taking the time to help

suflaj t1_j3bubtm wrote on January 7, 2023 at 1:02 PM

Another problem you will likely have is your very small convolutions. Basically, output channels of 8 and 16 are probably only enough to solve MNIST. You should then probably use something more like 32 and 64, and use larger kernels and strides to hopefully reduce reliance on the linears to do the work for you.

Finally, you are not using nonlinear activations between layers. Your whole network essentially acts like one smaller convolutional layer with a flatten and softmax.