Buddy77777

Buddy77777 t1_j4uvmzx wrote

If it’s converged on validation very flatly, it’s likely converged at a local minimum possibly for reasons I mentioned above… but also you can try adjusting hyper parameters, choosing curated weight intitializations (not pretrained), data augmentation, and the plethora of techniques that fall into the broad category of adversarial training.

3

Buddy77777 t1_j4utcmg wrote

Assuming no bugs, consider there are features and representations that can generalize to varying degrees and at varying receptive fields corresponding to neural depth.

If a pretrained CNN has gone through extensive training, the representations it has learned on its kernels across millions of images suggests it has already learned many generalizable features that seem to generalize to your dataset very well.

These could range from having Gabor filters from the get go at low receptive fields near the CNN surface to more complex but generalizable features deeper within.

It’s possible and likely pretraining went through extensive hyperparameter tuning (including curated weight initialization) that permitted it routes to relatively better optima that yours.

It’s possible with enough time, your implementation would reach that accuracy as well… but consider how long it takes to train effectively on millions of images! Even from the same starting weights, the pretrained model likely has significant training advantage.

You have the right intuition that you’d expect a model trained on a restricted domain to do better on that domain… but often time intuition in ML is backwards. Restricting the domain can permit networks to exploit weaker representations (especially true for classification tasks where, compared to something like segmentation, requires much less representational granularity).

Pretraining on many images, however, enforces a more robust model by requiring further differentiation which needs stronger representations. These stronger representations can make the difference for those edge case samples that can take 88% to 95%. Especially if representations are weak already and can generally get away with it, thereby having much lesser possibility of optima pseudo-exploration due to having no competing features and classes, one could suggest that it’s really easy to fall into a high local minimum.

I’m sure there are more possibilities we could theorize… and I’m quite possibly wrong about some of the ones I suggested… but, as you’ll discover with ML, things are more empirical than theoretical. To some (no matter how small) extent, the question: why does it work better can be answered with: it just does lol.

Rereading your post: key focus on your point about quickly reaching 95%. Again, it already has learned features it can exploit and perhaps just has to reweight its linear classifier, for example.

Anyways, my main point is that, generally, the outcome you witnessed is not surprising for reasons I gave and possibly other reasons as well

Oh, and keep in mind not all hyperparameters are equal, which is to say not all training procedures are equal. Their training setup is very likely to be an important factor and edge even if all else was equal.

Model performance is predicted by 1/3 data quality/quantity, 1/3 parameter count, 1/6 neural architecture, 1/6 training procedures/hyperparameter, and of course 99/100 words of encouragement via the terminal.

17