BrotherAmazing t1_iw18e2b wrote on November 12, 2022 at 3:49 AM

Usually if they share their dataset and problem with you, you can find something that is incredibly simple (just normal learning rate decay) and an alternative to gradient clipping, showing it was only “crucial” for their setup but not “crucial” in general, if you spend just a few hours on the problem and have extensive experience with designing and training deep NNs from scratch, and it will work just as well.

Often you can analyze datasets to see which mini-batches had the gradient exceeding various thresholds and understand what training examples led to large gradients and why, and pre-process the data, get rid of the need for clipping, and since the whole thing is nonlinear that might completely invalidate their other hyperparams once the training set is “cleaned up”.

Not saying this is what is going on here with this research group, but you’d be amazed how often this is the case and some complex trial-and-error is being done just to avoid debugging and understanding why the simpler approach that should have worked didn’t.