Submitted by twocupv60 t3_zbkvd2 in MachineLearning

I want to train an ensemble of 50 networks where each network is the same. The input is an image and the output is a scalar; simple binary classifier. Are the following mathematically equivalent:

  1. Train 50 models independently and average their results for the final ensemble model to use during inference. Logistically, i train 50 models.
  2. Create a super model composed of the 50 models where the top neuron is the average of all the individual model's output. Thus, I train all 50 models at once implicitly. Logistically, i train one model.

My initial thought is th at these are equivalent since I am taking the mean of the prediction probabilities so the backpropagation isnt aware of the other models. However, I could see the credit assignment of case 2 essentially changing the learning rate because instead of all the error going to a single model as in scheme 1, it is not distributed over all 50 models.

5

Comments

You must log in or register to comment.

Thakshu t1_iysghrm wrote

If i understand correctly , the question is whether training N clasifiers independently and obtaining their mean result is mathematically equivalent to training N classiefiers together with mean output .

For me it appears as not mathematically equivalent .(Edited a wrong statement here)

The gradient for back prop per step is calculated based on mean output of all classifiers . So the loss values will be smoother than the first case , if the starting point is independently initialized.

Do I have a thinking mistake ?. I can't identify it yet.

4

twocupv60 OP t1_iysmgyv wrote

The initial error is (y - y_hat)^2 where y_hat is mean(y1, ... yn). So the error is divided up among the y1...yn sequence based on how bad they contribute to the y. If models are trained separately, then thefull error of y is backproped. If the models are trained together, one model might have a lot of error which will influence the proportion assigned to the rest which I believe effectively lowers the learning rate. Is this what you mean by "loss values will be smoother."

Is there a mistake here?

1

Thakshu t1_iysr30r wrote

I think you are right here. But mathematical equivalence bothers me. Since they end up with dissimilar parameters , are they equivalent?.

1

MrsBotHigh t1_iyt03uj wrote

It is not the same because of the non linearity.

1