koolaidman123 t1_j5uk2ai wrote on January 25, 2023 at 5:50 PM

cache your predictions on each smaller batch w/ labels until you get a similar batch size, then run your loss function

so instead of calculating loss and accumulating like gradient accumulation, you only calculate loss once you reach the target batch size

rapist1 t1_j5xmv9n wrote on January 26, 2023 at 6:59 AM

How do you implement the cacheing? You have to cache all the activations to do the bawards pass

[deleted] t1_j5w9rbv wrote on January 26, 2023 at 12:21 AM

[deleted]

koolaidman123 t1_j5wbk37 wrote on January 26, 2023 at 12:34 AM

Thats not the same thing...

Gradient accumulation calcs the loss on each batch, it doesnt work with in batch negatives because you need compare input from batch 1 to inputs of batch 2, hence offloading and caching predictions, then calculating the loss with 1 batch

Thats why gradient accumulation doesnt work to simulate large batch sizes for contrastive learning, if youre familiar with it