koolaidman123 t1_j5wbk37 wrote on January 26, 2023 at 12:34 AM

Reply to comment by [deleted] in [D] Self-Supervised Contrastive Approaches that don’t use large batch size. by shingekichan1996

Thats not the same thing...

Gradient accumulation calcs the loss on each batch, it doesnt work with in batch negatives because you need compare input from batch 1 to inputs of batch 2, hence offloading and caching predictions, then calculating the loss with 1 batch

Thats why gradient accumulation doesnt work to simulate large batch sizes for contrastive learning, if youre familiar with it