Submitted by shingekichan1996 t3_10ky2oh in MachineLearning
shingekichan1996 OP t1_j5u22zn wrote
Reply to comment by mgwizdala in [D] Self-Supervised Contrastive Approaches that don’t use large batch size. by shingekichan1996
Curious about this, I have not read any paper related. What is its effect on the performance (accuracy, etc) ?
mgwizdala t1_j5u2mgr wrote
It depends on implementation. Naive gradient accumulation will probably give better results than small batches, but as u/RaptorDotCpp mentioned, if you relay on many negative samples inside one batch, it will still be worse than a large batch training.
There is also a cool paper about gradient caching, which somehow solves this issue, but again with an additional penalty on training speed. https://arxiv.org/pdf/2101.06983v2.pdf
shingekichan1996 OP t1_j5u40dx wrote
exactly the paper I need to read! Thanks!
Viewing a single comment thread. View all comments