ResponsibilityNo7189 t1_j1ulsd1 wrote on December 27, 2022 at 2:56 PM

That is why you have hundreds of millions of parameters in a network. There is so many ways for the weights to move that it's not a zero-sum game: some direction will not be so detrimental to other examples. It's precisely for this reason that self-supervised methods tend to work best on very deep networks. see "Scaling Vision Transformers".

derpderp3200 OP t1_j1vgi23 wrote on December 27, 2022 at 6:26 PM

I assume this is the case early into training, but eventually the training process starts needing to "compress" information so a given parameter handles more than one very specific case, at which point it'll be subject to this phenomenon again- any dog example will want "not dog" neurons inactive, any dog example will want neurons contributing to classification of other classes inactive.

Sure, statistically you're still descending down the slope of a network that's good at each class, but this is only the case when your classes - and thus the "pull effects" are balanced, not as an intrinsic ability of the network to extract differentiating features.