Submitted by **perfopt** t3_xyrolr
in **deeplearning**

#
**DrXaos**
t1_irjz674 wrote

log2(100) is about 6.64 and with 6865 samples that's 45.5K bits needed to fully encode/memorize the labels. You have way more than that in the effective # of bits in the free parameters. 25 million parameters? I train models on binary classification with 5000 params and a million observations.

You need some feature engineering and simplification of the model.

Are you doing something like this? https://en.wikipedia.org/wiki/Mel-frequency_cepstrum

Your frequency grid might be far too fine and you may need some windowing/filtering processing first. What's the structure of the 1723,13 input?

Given this is there some sort of informed unsupervised transformation to lower dimensionality you could use before the supervised classifier?

What you're seeing is the limits of purely blind statistical modeling, and since your dataset size isn't so big you'll have to build in some priors about the underlying 'physics' somehow through processing or structuring your model.

#
**perfopt**
OP
t1_irkuz3j wrote

I don’t follow the computation. How 45.5k bits?

I tried a model with [512,512,512] (perceptrons in each layer) and that performed very poorly < 0.2 accuracy.

#
**DrXaos**
t1_irl3dr7 wrote

It's information theory. If prior is uniform across the 100 classes (i.e. 1/100) (worst case) it takes -log(p) = log2(100) bits hypothetically to specify one actual label. Imagine it were 64 labels, then the explicit encoding is obvious, 6 bits. Information theory still works without explicit physical encoding in the appropriate limit. If priors are non-uniform it's even lower. There are 6865 examples. That's all the independent information about the labels which exists.

If you were to write out all the labels in a file, it could be compressed to no less than 45.5k bits if their probability distribution were uniform. So with hypothetically 45.5k bits in arbitrary free params you could memorize the labels. Of course in modeling there are practical constraints and regularization so this doesn't happen at that level but it should give you some pause. I know there are non-classical statistical behaviors with big models like double descent but I'm not sure we're there in this problem.

I think you're may be trying to do too much blind modeling without thinking. If you had to classify or cluster the signals by eyeball what would you look at? Can you start with a linear model? What features would you put in for that? If you're doing something like the MFCC from 'librosa' (as the youtube) there's all sorts of complex time-domain and frequency domain signal processing parameters in there that will strongly influence the results---I would concentrate on those foremost. As a first cut instead of going directly to a high parameter classifier which requires iterative stochastic training I would use a preliminary but fast-to-compute and (almost) deterministically optimizable criterion to help suggest your input space and signal processing parameters. What about clustering? If you had to do simple clustering in a Euclidean input space (you could literally program this and measure performance----how many observations are closer to the class centroid than someone else's centroid? Or just measure distances if it's not the correct centroid) what space would you use? Can you optimize to get good performance on that? Once you do that, then a high-effort complex classifier like a deep net would have a good head start and would help push performance further.

Or even what would a Naive Bayes model look like? Can you make/select features for that?

Also, one big consideration, often in audio classification there is a time translation invariance, in that the exact moment of the start isn't a physically important parameter; akin to image subset classification with 2-d x-y spatial translational invariance. If that's true then you could do lots of augmentation and make more signals of the same class with some translation operators applied for your train set.

Also consider performance measures different from 0/1 accuracy. Is that 'top 1' accuracy? And if the background accuracy is 0.01 (1/100 chance to get it right) then 0.2 might be considered good.

The no-information background performance is making a score proportional to the prior probabilities or maybe logodds thereof. Measure lift above that.

#
**kingfung1120**
t1_irmd87k wrote

Hi, I am still quite new to data science, this is the first time I see someone using information theory to measure whether a neural network has suitable amount of parameters.

Do you mind sharing more? Like the reference, or some examples. I would love to know more about this. Thank you!

Viewing a single comment thread. View all comments