Submitted by tsgiannis t3_10f5lnc in deeplearning

Well this is a question I never thought I had to ask.

We have a well known pretrained model and we remove the top layer and replace it with a "small" CNN

Training returns around 88% accuracy

Now we take the exact same model and I implement it from scratch and add the "small" CNN model

Training so far returns only 85% accuracy

Shouldn't the "from scratch" implementation return higher accuracy no matter what.

I know that the "pretrained' weights have adjusted to a "million" different Images...but they are a "million" different Images, they are not the specific images I want to train.

Strange thing that the "from scratch" struggles to get its training accuracy over 86% while the pretrained jumps very quickly to around 95% training accuracy

Any clues/ideas

21

Comments

You must log in or register to comment.

Buddy77777 t1_j4utcmg wrote

Assuming no bugs, consider there are features and representations that can generalize to varying degrees and at varying receptive fields corresponding to neural depth.

If a pretrained CNN has gone through extensive training, the representations it has learned on its kernels across millions of images suggests it has already learned many generalizable features that seem to generalize to your dataset very well.

These could range from having Gabor filters from the get go at low receptive fields near the CNN surface to more complex but generalizable features deeper within.

It’s possible and likely pretraining went through extensive hyperparameter tuning (including curated weight initialization) that permitted it routes to relatively better optima that yours.

It’s possible with enough time, your implementation would reach that accuracy as well… but consider how long it takes to train effectively on millions of images! Even from the same starting weights, the pretrained model likely has significant training advantage.

You have the right intuition that you’d expect a model trained on a restricted domain to do better on that domain… but often time intuition in ML is backwards. Restricting the domain can permit networks to exploit weaker representations (especially true for classification tasks where, compared to something like segmentation, requires much less representational granularity).

Pretraining on many images, however, enforces a more robust model by requiring further differentiation which needs stronger representations. These stronger representations can make the difference for those edge case samples that can take 88% to 95%. Especially if representations are weak already and can generally get away with it, thereby having much lesser possibility of optima pseudo-exploration due to having no competing features and classes, one could suggest that it’s really easy to fall into a high local minimum.

I’m sure there are more possibilities we could theorize… and I’m quite possibly wrong about some of the ones I suggested… but, as you’ll discover with ML, things are more empirical than theoretical. To some (no matter how small) extent, the question: why does it work better can be answered with: it just does lol.

Rereading your post: key focus on your point about quickly reaching 95%. Again, it already has learned features it can exploit and perhaps just has to reweight its linear classifier, for example.

Anyways, my main point is that, generally, the outcome you witnessed is not surprising for reasons I gave and possibly other reasons as well

Oh, and keep in mind not all hyperparameters are equal, which is to say not all training procedures are equal. Their training setup is very likely to be an important factor and edge even if all else was equal.

Model performance is predicted by 1/3 data quality/quantity, 1/3 parameter count, 1/6 neural architecture, 1/6 training procedures/hyperparameter, and of course 99/100 words of encouragement via the terminal.

17

tsgiannis OP t1_j4uu9q4 wrote

Thanks for the reply and I agree with you but...

Right now I am seeing the training of my model....it simply found a converging point and it's stuck around 86%+ training accuracy and 85%+ validation accuracy ... and I have observed this behavior more than once... so I am just curious.

Anyway probably the best answer is that it doesn't get enough features and its stuck ...because its unable to make some crucial separations.

1

Buddy77777 t1_j4uvmzx wrote

If it’s converged on validation very flatly, it’s likely converged at a local minimum possibly for reasons I mentioned above… but also you can try adjusting hyper parameters, choosing curated weight intitializations (not pretrained), data augmentation, and the plethora of techniques that fall into the broad category of adversarial training.

3

nibbajenkem t1_j4wii8d wrote

It's pretty simple. Deep neural networks are extremely underspecified by the data they train on https://arxiv.org/abs/2011.03395. Less data means more underspecification and thus the model more readily gets stuck in local minima. More data means you can more easily avoid certain local minima. So the question then boils down to the transferability of the learned features on different datasets. Imagenet pretraining generally works well because its a diverse and large scale dataset, which means models trained on it will by default avoid learning a lot of "silly" features.

14

tsgiannis OP t1_j4wk889 wrote

>Less data means more underspecification and thus the model more readily gets stuck in local minima

Probably this is the answer to the my "why".

1

I_will_delete_myself t1_j4ylmkp wrote

He just said why. It's because there isn't a diverse and large amount data you are training on. Imaginet was trained on many different kind of objects (over a million images) and while your toy dataset may probably only have 50-100k.

2

ContributionWild5778 t1_j5g6sio wrote

This! I would just add that you can never find the exact reason as to why your training from scratch is giving less accuracy. Do you have enough data for all the neurons to learn the features ? Can you cross validate the validation loss of your dataset and pre-trained one ? Did you try removing/adding a dense layer to check how the performance is changed ?

1

loopuleasa t1_j4v0ve4 wrote

the pretrained model was trained on more resources and with better tuning than you were able to provide

3

tsgiannis OP t1_j4v3ysi wrote

Now that's something to discuss..

>more resource

Now this is something well known...so skip it for now

>better tuning

This is the interesting info

What exactly do you mean on this... is it right to assume that all the papers that explain the architecture lack some fine details or is it something else.

1

Present-Ad-8531 t1_j4vvb1l wrote

Transfer learning.

3

tsgiannis OP t1_j4vw83q wrote

I know and this is what I use but ....

Just picture this in your mind..
You want to classify for example sports cars...shouldn't be reasonable to have images of sports cars and feed it to a model and let it learn..compared to images of frogs, butterflies ..etc (imagenet)

2

XecutionStyle t1_j4uu8r4 wrote

Are you fixing the weights of the earlier layers?

2

tsgiannis OP t1_j4uukd9 wrote

What exactly do you mean ...by "fixing weights" ?

The pretrained carries the weights from ImageNet and that's all .. if I unfreeze some layers it will get some more accuracy

But the "from scratch" starts empty.

1

XecutionStyle t1_j4v0dui wrote

When you replace the top layer and train the model, are the previous layers allowed to change?

3

tsgiannis OP t1_j4v447x wrote

No changes on the pretrained model besides removing the top layer.

I am aware that unfreezing can cause either good or bad results

1

ruphan t1_j4winzi wrote

It is definitely possible. Let me give an analogy first. In the context of education, let's assume our pretrained model is a person with multiple STEM degrees in fields like neuroscience, math etc.. And let your model that's trained from scratch be someone with no degree yet. We have a limited amount of resources like a couple of textbooks on deep learning. It's intuitive that the first person should not only be able to pick up deep learning faster but also be better than the latter, given that they have a better understanding of the fundamentals and experience.

To extend this analogy to your case, I believe that the pretrained model must be quite big for the limited amount of new data that you have. The pretrained model would have developed a better set of filters that just couldn't be learned with a relatively small dataset for a big model trained form scratch. This is just like the analogy where it doesn't matter if neuroscience and math are not exactly deep learning, having the fundamentals strong by pretraining on millions of images makes that model achieve better accuracy.

Maybe if you have a bigger fine-tuning dataset, this gap in accuracy should diminish eventually.

2

jsxgd t1_j4vruhl wrote

Are you saying in the “from scratch” implementation, you are only training using your own data? Or you are training the same architecture on the data used in pre-training + your own data?

1

tsgiannis OP t1_j4vvxt3 wrote

from scratch I mean I take the implementation of a model (just pick any) from articles and github pages, I copy paste it and I feed my data.

There is always a big accuracy difference no matter what...at first I thought it was my mistake because I always tinker what I copy but....

1

DrXaos t1_j4w9vav wrote

The data size of the pre trained model was likely enormously larger than yours and that overcomes the distribution shift.

1

junetwentyfirst2020 t1_j4wrzut wrote

The way I like to think about this is that the algorithm has to model many things. If you’re trying to learn whether the image contains a dog or not, first you have to model natural imagery, correlations between features, and maybe even a little 2D-to-3D to simplify invariances. I’m speaking hypothetically here, because the underlying model is quite latent and hard to inspect.

If you train from scratch you need to do all of these tasks on a dataset that is likely much smaller than is required to do all of them without overfitting. If you use a pretrained model, instead of learning all of those tasks, you instead have a model that only has to learn just one additional thing on the same amount of data.

1