Submitted by perfopt t3_xyrolr in deeplearning

When I try to improve train accuracy by using L2 regularization and Dropout the model performs worse than without them. What am I doing wrong?

Also, please do give me any suggestions to improve validation accuracy.

My classification problem has 100 categories.

Data: I have 6865 samples which is an average of 68.65 samples per category. The category with the smallest number of samples has 52 samples and the one with the largest number of samples has 75.

Below is my model summary.

_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 flatten (Flatten)           (None, 22399)             0         
                                                                 
 dense (Dense)               (None, 1024)              22937600  
                                                                 
 dense_1 (Dense)             (None, 1024)              1049600   
                                                                 
 dense_2 (Dense)             (None, 512)               524800    
                                                                 
 dense_3 (Dense)             (None, 512)               262656    
                                                                 
 dense_4 (Dense)             (None, 100)               51300     
                                                                 
=================================================================
Total params: 24,825,956
Trainable params: 24,825,956
Non-trainable params: 0
Train Accuracy At Epoch Test/Val Accuracy At Epoch
Baseline 0.9981 180 0.7074 258
With L2 + Dropout 0.7725 297 0.5196 244

Edit: I am using these hyperparams L2 0.001 and Dropout 0.1

3

Comments

You must log in or register to comment.

jellyfishwhisperer t1_iriddyc wrote

Regularization and drop out helps with overfitting. It will almost always reduce your training accuracy. What you need is a testing dataset and compare there.

11

perfopt OP t1_iridohw wrote

Thank you for the response. I am breaking my data to train and validation sets. Do you mean another set for test?

The baseline is overfitting - test accuracy is really high and val accuracy is much lower. That is why I added L2+Dropout

Since the validation accuracy is still very low (52%) should I not focus on improving that?

1

manuLearning t1_irie761 wrote

I had always good experiences with dropout. Try to put a dropout layer of around 0.75 after your first layer and onedropout layer before your last layer. You can also put a light 0.15 layer before your first layer.

How similar is the test and val set?

2

perfopt OP t1_iriens4 wrote

For creating test and val I used test_train_spilt from sklearn

I'll I manually examine it.

But in general shouldn't the distribution be OK?

inputs_train, inputs_test, targets_train, targets_test = train_test_split(inputs, targets, test_size=0.1)
1

manuLearning t1_irij2hl wrote

A rule of thumb is to take around 30% as val set

1

perfopt OP t1_irij7vh wrote

I tried that as well with similar results when adding L2+dropout

0

chatterbox272 t1_irifzyq wrote

Your model is a teeny-tiny MLP, your dataset is relatively small, it's entirely possible that you're unable to extract rich enough information to do better than 70% on the val set.

You also haven't mentioned how much L2 or Dropout you're using, nor how they do on their own. Both of those methods come with their own hyperparameters which need to be tuned.

4

perfopt OP t1_irig9zc wrote

I see. I’ll try increasing the data used. My fear is that it may lead to a some categories having much less data than others.

L2 0.001 and Dropout 0.1

1

kingfung1120 t1_iriks76 wrote

What is the type of data that you are inputing into the model?

2

perfopt OP t1_iril212 wrote

The data is MFCCs created from audio files. Sort of like this - https://www.youtube.com/watch?v=szyGiObZymo

1

kingfung1120 t1_iripkba wrote

I haven’t handled audio data before, but it seems like you are flattening a [1723, 13]shape data into a vector(correct me if I am wrong), which is definitely going to affect the information that the model can learn since the data is sequential and it is in 2-D.

Unfortunately, I haven’t studied/read anything related to audio data deep learning, I couldn’t give you anymore in-depth opinion, but based on my understanding, using a CNN or anything recurrent should improve the model performance better than fine-tuning a MLP.

2

perfopt OP t1_iriqicz wrote

Yes you are correct. I am flattening (1723, 13) shape data.

I will try out CNN as well.

2

kingfung1120 t1_irirqor wrote

Look forward to receiving updates from you ;)

1

perfopt OP t1_iritkyy wrote

Certainly. I've got to travel a couple of days but Tue after work I'll be back on this.

2

perfopt OP t1_is5q78g wrote

Got back to a totally crazy week at work. Finally got time to spend on my project. I think I need to simplify my inputs and give MFCC another try before jumping into CNNs

2

DrXaos t1_irjz674 wrote

log2(100) is about 6.64 and with 6865 samples that's 45.5K bits needed to fully encode/memorize the labels. You have way more than that in the effective # of bits in the free parameters. 25 million parameters? I train models on binary classification with 5000 params and a million observations.

You need some feature engineering and simplification of the model.

Are you doing something like this? https://en.wikipedia.org/wiki/Mel-frequency_cepstrum

Your frequency grid might be far too fine and you may need some windowing/filtering processing first. What's the structure of the 1723,13 input?

Given this is there some sort of informed unsupervised transformation to lower dimensionality you could use before the supervised classifier?

What you're seeing is the limits of purely blind statistical modeling, and since your dataset size isn't so big you'll have to build in some priors about the underlying 'physics' somehow through processing or structuring your model.

2

perfopt OP t1_irkuz3j wrote

I don’t follow the computation. How 45.5k bits?

I tried a model with [512,512,512] (perceptrons in each layer) and that performed very poorly < 0.2 accuracy.

1

DrXaos t1_irl3dr7 wrote

It's information theory. If prior is uniform across the 100 classes (i.e. 1/100) (worst case) it takes -log(p) = log2(100) bits hypothetically to specify one actual label. Imagine it were 64 labels, then the explicit encoding is obvious, 6 bits. Information theory still works without explicit physical encoding in the appropriate limit. If priors are non-uniform it's even lower. There are 6865 examples. That's all the independent information about the labels which exists.

If you were to write out all the labels in a file, it could be compressed to no less than 45.5k bits if their probability distribution were uniform. So with hypothetically 45.5k bits in arbitrary free params you could memorize the labels. Of course in modeling there are practical constraints and regularization so this doesn't happen at that level but it should give you some pause. I know there are non-classical statistical behaviors with big models like double descent but I'm not sure we're there in this problem.

I think you're may be trying to do too much blind modeling without thinking. If you had to classify or cluster the signals by eyeball what would you look at? Can you start with a linear model? What features would you put in for that? If you're doing something like the MFCC from 'librosa' (as the youtube) there's all sorts of complex time-domain and frequency domain signal processing parameters in there that will strongly influence the results---I would concentrate on those foremost. As a first cut instead of going directly to a high parameter classifier which requires iterative stochastic training I would use a preliminary but fast-to-compute and (almost) deterministically optimizable criterion to help suggest your input space and signal processing parameters. What about clustering? If you had to do simple clustering in a Euclidean input space (you could literally program this and measure performance----how many observations are closer to the class centroid than someone else's centroid? Or just measure distances if it's not the correct centroid) what space would you use? Can you optimize to get good performance on that? Once you do that, then a high-effort complex classifier like a deep net would have a good head start and would help push performance further.

Or even what would a Naive Bayes model look like? Can you make/select features for that?

Also, one big consideration, often in audio classification there is a time translation invariance, in that the exact moment of the start isn't a physically important parameter; akin to image subset classification with 2-d x-y spatial translational invariance. If that's true then you could do lots of augmentation and make more signals of the same class with some translation operators applied for your train set.

Also consider performance measures different from 0/1 accuracy. Is that 'top 1' accuracy? And if the background accuracy is 0.01 (1/100 chance to get it right) then 0.2 might be considered good.

The no-information background performance is making a score proportional to the prior probabilities or maybe logodds thereof. Measure lift above that.

2

kingfung1120 t1_irmd87k wrote

Hi, I am still quite new to data science, this is the first time I see someone using information theory to measure whether a neural network has suitable amount of parameters.

Do you mind sharing more? Like the reference, or some examples. I would love to know more about this. Thank you!

1