Submitted by Thijs-vW t3_yta05n in deeplearning

I have a neural network which was trained on some data. Now, I am receiving additional samples, on which I would like to train the network. Simultaneously, I do want to use this model as starting point, considering that creating a new model may result in a drastically different weight-matrix. What is the best approach to do this? Here are some of my thoughts:

  • Concatenate old and new data and train one epoch.
  • Train one epoch on new data only.

No matter which of these approaches I choose, the following problems will remain difficult to avoid:

  • Catastrophic forgetting.
  • Overfitting on new data.

What are some things I can do to avoid these problems? Is decreasing the learning rate enough?

15

Comments

You must log in or register to comment.

alcome1614 t1_iw35t9r wrote

First thing is to keep a copy of the neural network already trained. So you can try whatever you want

10

RichardBJ1 t1_iw43b43 wrote

Well transfer learning would be the thing I would expect peep to say, freeze the top and bottom layers, re-load the old model weights and continue training….. but for me the best thing to do has always been to use throw the old weights away and mix up the old and new training data sets and start again…. Sorry!!

3

MyHomeworkAteMyDog t1_iw48fy1 wrote

How about you mix old and new samples together, and only back propagate error on new samples, while tracking the error on old samples. Observe whether training on new samples is hurting performance on old samples.

8

scitech_boom t1_iw4ck9z wrote

>Concatenate old and new data and train one epoch.

This is what I did in the past and it worked reasonably well for my cases. But is that the best? I don't know.

Anyhow, you cannot do this:

>Simultaneously, I do want to use this model as starting point,

Instead pick the weights from 2 or 3 epochs before the best performing one in the previous training. That should be the starting point.

Training on the top of something that has already hit the bottom wont help, even if we add more data.

5

BugSlayerJohn t1_iw5kxgs wrote

If retraining the entire model on the complete data set is possible with nominal cost in less than a few days, do that. If not, it's worth trying transfer learning: https://www.tensorflow.org/guide/keras/transfer_learning

Note that transfer learning is a shortcut, you are almost certainly sacrificing some accuracy to avoid a prohibitive amount of retraining. You'll also still need to train the new layers against a data set that completely represents the results you want. I.E. if you train only on the new data, that's all it will know how to predict.

If you don't have the original data set, but do have abundant training resources and time, you could try a Siamese-like approach, where a suitable percentage of the training data fed to the new network is generated data with target values provided based on predictions from the current network, and the remaining data is the new data you would like the network to learn. This will probably work better when the new data is entirely novel.

3

Thijs-vW OP t1_iw6ryeq wrote

Thanks for the advice. Unfortunately I do not think transfer learning is the best thing for me to do, considering:

>if you train only on the new data, that's all it will know how to predict.

Anyhow,

>If retraining the entire model on the complete data set is possible with nominal cost in less than a few days, do that.

This is indeed the case. However, if I retrain my entire model, it is very likely that the new model will make entirely different predictions due to its weight matrix not being identical. This is the problem I would like to avoid. Do you have any advice on that?

1

scitech_boom t1_iw6zpa9 wrote

There are multiple reasons. The main issue has to do with validation error. It usually follows a U curve, with a minimum at some epoch. This is the point at which we usually stop the training (`early stopping`). Any further training, with or without new data is only going to make the performance worse (I don't have a paper to cite for that).

I also started with the best model and that did not work. But when I took the model 2 epochs before the best model, it worked well. In my case(speech recognition), it was a nice balance between improvement and training time.

1

jobeta t1_iw6zxwa wrote

I don’t have much experience with that specific problem but I would tend to think it’s hard to generalize like this to “models that hit the bottom” without knowing what the validation loss actually looked like and what that new data looks like. Chances are, this data is not just perfectly sampled from the first dataset and the features have some idiosyncratic/new statistical properties. In that case, by feeding them in some way to your pre-trained model, the model loss is mechanically not in that minima it supposedly reached in the first training run anymore.

1

RichardBJ1 t1_iw71rpv wrote

Good question; I do not have a source for that, have just heard colleagues saying that. Obviously the reason for freezing layers is that we are trying to avoid loosing all the information we have already gained. Should speed up further training by reducing parameter numbers etc. As to actually WHICH layers are best persevered I don’t know. When I have read on it, people typically say “it depends”. But actually my point was I have never found transfer learning to be terribly effective (apart from years ago when I ran a specific transfer learning tutorial!). In my models it only takes a few days to start from scratch and so this it what I do! Transfer learning obviously makes enormous sense if you are working with someone else’s extravagantly trained model and you may be don’t even have the data. But in my case I always do have all the data…

1

jobeta t1_iw7228e wrote

It seems intuitive that if possible, fully retraining will yield the best results but it can be costly. I just find it surprising to arbitrarily freeze two layers. What if your model only has two layers anyways? Again I don’t have experience so just guessing

2

RichardBJ1 t1_iw733qt wrote

Yes …obviously freezing the only two layers would be asinine! There is a keras blog on it, I do not know why particular layers (TL;DR). It doesn’t say top and bottom that’s for sure. …I agree it would be nice to have method in the choice of layers to freeze rather than arbitrary. I guess visualising layer output might help choose if a small model, but I’ve never tried that. So I do have experience of trying transfer learning, but (apart from tutorials) no experience of success with transfer learning!

1

ContributionWild5778 t1_iw97xid wrote

I believe that is an iterative process when doing transfer learning. First you will always freeze the top layers because low level feature extraction is done over there (extracting lines and contours). Unfreeze the last layers and try to train those layers only where high level features are extracted. At the same time it also depends on how different the new dataset is using which you are training the model. If it contains similar characteristics/features freezing top layers would be my choice

1

ContributionWild5778 t1_iw98mo2 wrote

If you want to re-train the whole model with mixed dataset. The only option I can think of is transfer learning where you only initialise all the parameters in the same way which were used to train on the old dataset and re-train from 0th epoch

1

BugSlayerJohn t1_iwa32dc wrote

First of all, you don't want an identical or nearly identical weight matrix. You won't achieve that and you don't need to. In principle a well designed model should NOT make radically different predictions when retrained, particularly with the same data, even though the weight matrices will certainly differ at least a little and possibly a lot. The same model trained two different times on the same data with the same hyperparameters will generally converge to nearly identical behaviors, right down to which types of inputs the final model struggles with. If you have the original model, original data, and original hyperparameters, definitely don't be frightened to retrain a model.

If your use case requires you to be able to strongly reason about similarity of inference, you could filter your holdout set for the inputs that both models should accurately predict, run inference for that set against both models, and prepare a small report indicating the similarity of predictions. This should ordinarily be unnecessary, but since it sounds like achieving this similarity is a point of concern, this would allow you to measure it, if for no other purpose than to assuage fears. You should likely expect SOME drift in similarity, the different versions won't be identical, so if the similarity is not as high as you like consider manually reviewing a list of inputs that the two models gave different predictions for to confirm the rate at which the difference really is undesirable.

1

PredictorX1 t1_iwiflod wrote

Assuming that you will not change the network architecture, I suggest concatenating both data sets, start training using the existing weights, and train as long as necessary. I would suggest re-examining the size of the hidden layer, though (which implies starting training over from scratch).

1