melgor89
melgor89 t1_j5u766t wrote
Reply to [D] Self-Supervised Contrastive Approaches that don’t use large batch size. by shingekichan1996
There is a great paper about analyzing batch size vs accuracy correlation. They propose loss function, which is able to learn SimClr on bs=256 instead of 4k. So, there is some research in this domain. https://arxiv.org/abs/2110.06848
melgor89 t1_j5u6pdr wrote
Reply to comment by mgwizdala in [D] Self-Supervised Contrastive Approaches that don’t use large batch size. by shingekichan1996
As said in the topic, gradient accumulation is not a solution. However, gradient checkpointing could be. https://paperswithcode.com/method/gradient-checkpointing It recompute some of the features map during backwards pass so that they are not stored in memory. So you can fit bigger batch size
melgor89 t1_iqrxupz wrote
Reply to comment by bernhard-lehner in [D] Things to do for effective ML teamwork at an early stage startup by coinfelix
I would say that lack of documentation is one of the key issues in startups. Ex. Then nobody knows why sth was created on that way, what are the scores from the previous version, what was the main issue from last model. Everything is in sb mind, but when this research left the company, retreating the whole pipeline takes a lot of time.
So I would say proper documentation + simple code, without unnecessary abstraction is a key to move startup further.
People say that there is no time for making documentation as there are more important tasks. From my perspective, it is short term thinking as then you would spend 3x more time on figuring it out why sth was done in such way. This is just my thoughts, based on 5 years in startups and 4 in corporations.
melgor89 t1_j6xufba wrote
Reply to [D] ImageNet normalization vs [-1, 1] normalization by netw0rkf10w
From my experience, they are equal now, especially when we are using now BatchNorm or LayerNorm. Both normalization methods also use mean and std value, and I make irrelevant, which kind of method you are using. Then I prefere the TensorFlow idea as it is simpler one.