Submitted by shingekichan1996 t3_10ky2oh in MachineLearning
koolaidman123 t1_j5uk2ai wrote
cache your predictions on each smaller batch w/ labels until you get a similar batch size, then run your loss function
so instead of calculating loss and accumulating like gradient accumulation, you only calculate loss once you reach the target batch size
rapist1 t1_j5xmv9n wrote
How do you implement the cacheing? You have to cache all the activations to do the bawards pass
[deleted] t1_j5w9rbv wrote
[deleted]
koolaidman123 t1_j5wbk37 wrote
Thats not the same thing...
Gradient accumulation calcs the loss on each batch, it doesnt work with in batch negatives because you need compare input from batch 1 to inputs of batch 2, hence offloading and caching predictions, then calculating the loss with 1 batch
Thats why gradient accumulation doesnt work to simulate large batch sizes for contrastive learning, if youre familiar with it
Viewing a single comment thread. View all comments