melgor89

melgor89 t1_j6xufba wrote

From my experience, they are equal now, especially when we are using now BatchNorm or LayerNorm. Both normalization methods also use mean and std value, and I make irrelevant, which kind of method you are using. Then I prefere the TensorFlow idea as it is simpler one.

3

melgor89 t1_j5u6pdr wrote

As said in the topic, gradient accumulation is not a solution. However, gradient checkpointing could be. https://paperswithcode.com/method/gradient-checkpointing It recompute some of the features map during backwards pass so that they are not stored in memory. So you can fit bigger batch size

1

melgor89 t1_iqrxupz wrote

I would say that lack of documentation is one of the key issues in startups. Ex. Then nobody knows why sth was created on that way, what are the scores from the previous version, what was the main issue from last model. Everything is in sb mind, but when this research left the company, retreating the whole pipeline takes a lot of time.

So I would say proper documentation + simple code, without unnecessary abstraction is a key to move startup further.

People say that there is no time for making documentation as there are more important tasks. From my perspective, it is short term thinking as then you would spend 3x more time on figuring it out why sth was done in such way. This is just my thoughts, based on 5 years in startups and 4 in corporations.

2