mgwizdala t1_j5u2mgr wrote on January 25, 2023 at 4:01 PM

Reply to comment by shingekichan1996 in [D] Self-Supervised Contrastive Approaches that don’t use large batch size. by shingekichan1996

It depends on implementation. Naive gradient accumulation will probably give better results than small batches, but as u/RaptorDotCpp mentioned, if you relay on many negative samples inside one batch, it will still be worse than a large batch training.

There is also a cool paper about gradient caching, which somehow solves this issue, but again with an additional penalty on training speed. https://arxiv.org/pdf/2101.06983v2.pdf

mgwizdala t1_j5tyf1f wrote on January 25, 2023 at 3:34 PM

Reply to [D] Self-Supervised Contrastive Approaches that don’t use large batch size. by shingekichan1996

If you are willing to trade time for batch size you can try with gradient accumulation