Submitted by shingekichan1996 t3_10ky2oh in MachineLearning
koolaidman123 t1_j5wbk37 wrote
Reply to comment by [deleted] in [D] Self-Supervised Contrastive Approaches that don’t use large batch size. by shingekichan1996
Thats not the same thing...
Gradient accumulation calcs the loss on each batch, it doesnt work with in batch negatives because you need compare input from batch 1 to inputs of batch 2, hence offloading and caching predictions, then calculating the loss with 1 batch
Thats why gradient accumulation doesnt work to simulate large batch sizes for contrastive learning, if youre familiar with it
Viewing a single comment thread. View all comments