Viewing a single comment thread. View all comments

koolaidman123 t1_j5uk2ai wrote

cache your predictions on each smaller batch w/ labels until you get a similar batch size, then run your loss function

so instead of calculating loss and accumulating like gradient accumulation, you only calculate loss once you reach the target batch size

10

rapist1 t1_j5xmv9n wrote

How do you implement the cacheing? You have to cache all the activations to do the bawards pass

3

[deleted] t1_j5w9rbv wrote

[deleted]

−8

koolaidman123 t1_j5wbk37 wrote

Thats not the same thing...

Gradient accumulation calcs the loss on each batch, it doesnt work with in batch negatives because you need compare input from batch 1 to inputs of batch 2, hence offloading and caching predictions, then calculating the loss with 1 batch

Thats why gradient accumulation doesnt work to simulate large batch sizes for contrastive learning, if youre familiar with it

8