RaptorDotCpp t1_j5u0yxq wrote on January 25, 2023 at 3:50 PM

Gradient accumulation is tricky for contrastive methods that rely on having lots of negatives in a batch.

altmly t1_j5uglpx wrote on January 25, 2023 at 5:29 PM

I'm confused. Gradient accumulation is exactly equivalent to batching as long as the data is the same, unless you use things like batch norm (you shouldn't).

Paedor t1_j5ur6tx wrote on January 25, 2023 at 6:33 PM

The trouble is that contrastive methods often compare elements from the same batch, instead of treating elements as independent like pretty much all other ML (except batchnorm).

As a simple example with a really weird version of contrastive learning: with a batch of 2N, contrastive learning might use the 4N^2 distances between batch elements to calculate a loss, while with two accumulated batches of N, contrastive learning could only use 2N^2 pairs for loss.

satireplusplus t1_j5v24u2 wrote on January 25, 2023 at 7:39 PM

If you don't have 8 GPUs you can always run the same computation 8x in series on one GPU. Then you merge the results the same way the parallel implementation would do it. In most cases that's probably gonna end up being a form of gradient accumulation. Think of it this way: you basically compute your distances on a subset of n, but since there are much fewer pairs of distances, the gradient would be noisy. So you just run it a couple of times and average the result to get an approximation of the real thing. Very likely that this is what the parallel implementation does too.

koolaidman123 t1_j5ujfpv wrote on January 25, 2023 at 5:46 PM

contrastive methods require in-batch negatives, you can't replicate that with grad accumulation

shingekichan1996 OP t1_j5u22zn wrote on January 25, 2023 at 3:57 PM

Curious about this, I have not read any paper related. What is its effect on the performance (accuracy, etc) ?

mgwizdala t1_j5u2mgr wrote on January 25, 2023 at 4:01 PM

It depends on implementation. Naive gradient accumulation will probably give better results than small batches, but as u/RaptorDotCpp mentioned, if you relay on many negative samples inside one batch, it will still be worse than a large batch training.

There is also a cool paper about gradient caching, which somehow solves this issue, but again with an additional penalty on training speed. https://arxiv.org/pdf/2101.06983v2.pdf

shingekichan1996 OP t1_j5u40dx wrote on January 25, 2023 at 4:09 PM

exactly the paper I need to read! Thanks!

melgor89 t1_j5u6pdr wrote on January 25, 2023 at 4:27 PM

As said in the topic, gradient accumulation is not a solution. However, gradient checkpointing could be. https://paperswithcode.com/method/gradient-checkpointing It recompute some of the features map during backwards pass so that they are not stored in memory. So you can fit bigger batch size

[D] Self-Supervised Contrastive Approaches that don’t use large batch size.

mgwizdala t1_j5tyf1f wrote on January 25, 2023 at 3:34 PM