Submitted by **hardmaru** t3_ys36do
in **MachineLearning**

## Comments

#
**maybelator**
t1_ivxgacq wrote

> Is the derivative of ReLU at 0.0 equal to NaN, 0 or 1?

The derivative of ReLu is not defined at 0, but its subderivative is and is the set [0,1].

You can pick any value in this set, and you end up with (stochastic) subgradient descent, which converges for small enough learning rates (to a critical point).

For ReLu, the discontinuity are of mass 0 and are not "attractive", ie there is no reason for the iterate to end up exactly at 0, so it can be safely ignored. This is not the case for the L1 norm for example, whose subgradient at 0 is [-1,1]. It present a "kink" at 0 as the subderivative contains a neighborhood of 0, and hence is attractive: your iterate will get stuck there. In these cases, it is recommended to use proximal algorithms, typically forward-backward schemes.

#
**Phoneaccount25732**
t1_ivydmgs wrote

I want more comments like this.

#
**9182763498761234**
t1_ivy1mud wrote

Cool, thanks for sharing :-)

#
**robbsc**
t1_ivypqg0 wrote

Thanks for taking the time to type this out

#
**samloveshummus**
t1_iw1o1jg wrote

This has to be one of the most useful comments I've read in nearly ten years on Reddit! You must be a gifted teacher.

#
**[deleted]**
t1_iw2kgbt wrote

[deleted]

#
**zimonitrome**
t1_iwbmzoq wrote

Huber loss let's go.

#
**maybelator**
t1_iwbpkjo wrote

Not if you want true sparsity !

#
**zimonitrome**
t1_iwbst8p wrote

Can you elaborate?

#
**maybelator**
t1_iwbxutj wrote

The Huber loss encourages the regularized variable to be close to 0. However, this loss is also smooth: the amplitude of the gradient decreases as the variable nears its stationary point. In consequence, it will have many coordinates close to 0 but not exactly. Achieving true sparsity requires thresholding which adds a a lot of other complications.

In contrast the amplitude of the gradient of the L1 norm (absolute value in dim 1) remain the same no matter how close it gets to 0. The functional has a kink (the subgradient contains a neighborhood of 0). In consequence, if you used a well-suited optimization algorithm, the variable will have true sparsity, i.e. a lot of exact 0.

#
**zimonitrome**
t1_iwc14i5 wrote

Wow thanks for the explanation, it does make sense.

I had a pre-conception that all optimizers dealing with any linear functions (kinda like L1 norm) still produce values close to 0.

I can see someone disregarding tiny values when using said sparsity (pruning, quantization) but didn't think that it would be exactly 0.

#
**ThisIsMyStonerAcount**
t1_ivy34sr wrote

Knowing about subgradients (see other answers) is nice and all, but in the real world what matters is what your framework does. Last time I checked, both pytorch and jax say that the derivative of `max(x, 0)`

is 0 when x=0.

#
**samloveshummus**
t1_iw1ofup wrote

Good point. But it's not the end of the world; those frameworks are open source, after all!

#
**Bot-69912020**
t1_ivxbxml wrote

I don't know about each specific implementation, but via the definition of subgradients you can get 'derivatives' of convex but non-differentiable functions (which ReLU is).

More formally: A subgradient at a point x of a convex function f is any x' such that f(y) >= f(x) + < x', y - x > for all y. The set of all possible subgradients at a point x is called the subdifferential of f at x.

For more details, see here.

#
**[deleted]**
t1_ivxslaq wrote

[deleted]

#
**elcric_krej**
t1_ivy6jf4 wrote

This is awesome in that it potentially removes a lot of random variance from the process of training, I think the rest of the benefits are comparatively small and safely ignorable.

I would love if it were picked up as a standard, it seems like the kind of thing that might get rid of a lot of the worst seed hacking out there.

But I'm an idiot, so I'm curios what well-informed people think about it.

#
**master3243**
t1_iw1h2h7 wrote

> potentially removes a lot of random variance from the process of training

You don't need the results of this paper for that.

One of my teams had a pipeline where every single script would initialize the seed of all random number generators (numpy, torch, pythons radom) to 42.

This essentially removed non-machine-precision stochasticity between different training iterations with the same inputs.

#
**bluevase1029**
t1_iw1khv8 wrote

I believe it's still difficult to be absolutely certain you have same initialisation across multiple machines, versions of pytorch etc. I could be wrong though.

#
**master3243**
t1_iw1mpgb wrote

Definitely if each person has a completely different setup.

But that's why we contenirize our setups and use a shared environment setup

#
**elcric_krej**
t1_iw7hss0 wrote

I guess so, but that doesn't scale to more than one team (we did something similar) and arguably you want to test across multiple seeds, assume some init + model are just very odd minima.

This seems to yield higher uniformity without constraining us on the rng.

But see /u/DrXaos for why not really

#
**DrXaos**
t1_iw7o3ef wrote

In my typical use, I’ve found that changing random init seeds (and also random seeds for shuffling examples during training, don’t forget that one) in many cases induces a larger variance on performance than many algorithmic or hyper parameter changes. Most prominently with imbalanced classification, which if often the reality of the valuable problem.

I guess it’s better to be lucky than smart.

Avoiding looking at the results of random init can make you think you’re smarter than you are and will tell yourselves false stories.

#
**DrXaos**
t1_iw03k6k wrote

I’m not entirely convinced it eliminates every random choice. There is usually a permutation symmetry on tabular inputs, and among hidden nodes.

If I’m reading it correctly, then for a single scalar output of a regressor or classifier coming from hiddens or inputs directly (logreg), it would set the coefficient of the first node to 1 and 0 to all others being a truncated identity.

But what’s so special about that first element. Nothing. Same applies to the Hadamard matrices, it’s making one choice from an arbitrary ordering.

In my opinion, there still could/should be a random permutation of columns on interior weights and I might init the final linear layer of the classifier to equal but nonzero values like 1/sqrt(Nh), and with random sign if hidden activations are nonnegative like relu or sigmoid, instead of symmetric like tanh.

Maybe also random +1/-1 signs times random permutation times identity?

By that matter, any orthogonal rotation also preserves dynamical isometry, and so a random orthogonal before truncated identity should also work as init, and we’re back to an already existing suggested init method.

Training for enhanced sparsity is interesting, though.

#
**samloveshummus**
t1_iw1oyhf wrote

>I would love if it were picked up as a standard, it seems like the kind of thing that might get rid of a lot of the worst seed hacking out there.

I don't want to be facetious, but what's wrong with "seed hacking"? Maybe that's a fundamental part of making a good model.

If we took someone other than Albert Einstein, and gave them the same education, the same career, the same influences and stresses, would that other person be equally as likely to realise how to explain the photoelectric effect, Brownian motion, blackbody radiation, general relativity and E=mc^(2)? Or was there something special about Einstein's genes meaning we need those initial conditions *and* that training schedule for it to work.

#
**machinelearner77**
t1_iw21k83 wrote

I guess the problem with "seed hacking" is just that it reduces trust in the proposed *method*. People want to build on methods that aren't brittle and if presented model performance depends (too) much on random seed it lowers trust in the *method* and makes people less likely to want to build on it

#
**samloveshummus**
t1_iwhwh8y wrote

Sure, but maybe it's inescapable.

When we recruit for a job, we first select a candidate from CVs and interviews, and only once we've chosen a candidate do we begin training them.

Do you think it makes sense to strive for a recruitment process that will get perfect results from any candidate, so we can stop wasting time on interviews and just hire whoever? Or is it inevitable that we have to select among candidates before we begin the training? Why should it be different for computers?

#
**canbooo**
t1_ivx9yjn wrote

Very interesting stuff, just skimmed through and will definitely read more in depth but how does this break symmetry?

#
**jimmiebtlr**
t1_ivy59yw wrote

Haven’t read it yet, but wouldnt symmetry only exist for 2 node if the input and output weights have the same 1s and 0s?

#
**canbooo**
t1_ivydtlt wrote

You are right and what I ask may be practically irrelevant and I really should rtfp. However, think about the edge case of 1 Layer with 1 input and 1 output. Each node having 1 as input weight sees the same gradient, similar to the nodes having 0. Increasing the number of inputs make it combinatorially improbable to have the same configuration but increasing the number of nodes in a layer makes it likelier. So, it *could* be relevant for low dimensions or models with a narrow bottleneck. I am sure that the authors already thought about this problem and either discarded it as it is quite unlikely in their tested settings or they already have a solution/analysis somewhere in the paper, hence my question.

#
**vjb_reddit_scrap**
t1_ivymo0p wrote

IIRC Hinton et al had a paper about initializing RNNs with identity and it solved many problems that LSTM solves.

#
**DrXaos**
t1_iw04agd wrote

That’s a different scenario and clearly dynamically justified.

Any recursive neural network is like a nonlinear dynamical system. Learning happens best on the boundary of dissipation vs chaos (exploding or vanishing gradients).

The additive incorporation of new info in LSTM/GRU greatly ameliorates that usual problem of RNNs with random transition matrices where perturbations evolve multiplicatively. RNN initted to zero Lyapunov exponent through identity is helpful.

#
**AnimaAnandkumar**
t1_iwe93vq wrote

Thank you for posting our paper. These slides sum up our work and how it removes degeneracy arising from identity initialization https://twitter.com/AnimaAnandkumar/status/1590963759954423810?s=20&t=8V3J8VOrbn1w-rZY_Lplqg

https://twitter.com/AnimaAnandkumar/status/1590963759954423810?s=20&t=8V3J8VOrbn1w-rZY_Lplqg

#
**martinkunev**
t1_ivxtknz wrote

The abstract looks very promising. I'm wondering why there is just 1 citation in 4 months. Is there a caveat?

#
**new_name_who_dis_**
t1_ivy2et6 wrote

Getting lots of citations a few month after your paper comes out only happens with papers written by famous researchers. Normal people need to work to get people to notice their research (which is they are sharing it here now).

And usually a paper starts getting citations after it’s already been presented at a conference where you can do the most easiest promotion of it.

#
**terranop**
t1_ivyafa1 wrote

While what you are saying here is true, it doesn't really apply in this case because Anima Anandkumar *is* a famous researcher.

#
**new_name_who_dis_**
t1_ivybdq7 wrote

Oh I didn’t know them. Still if it’s only been out a few months for it to be cited it would have needed to be noticed by someone who is writing their next research paper and have that paper already published.

Unless preprints on arxiv count. But even then it takes weeks if not months to do research and write a paper. So that leaves such a small window for possible citations at this point.

#
**samloveshummus**
t1_iw1qaft wrote

As well as what the other commenters are saying, sometimes deeper stuff takes longer to have an impact. If you look through the history of science (and human endeavor more generally), there are many famous examples of people whose work revolutionized our modern world, but who weren't recognized in their lifetime - society needed time to catch up.

Now I think we can do a lot better than that. We're a global civilization that communicates at lightspeed. However, we are still also big hairless apes with CPUs made of electric jelly, so we take a while to process things. The more unexpected, the more processing we need.

#
**lynnharry**
t1_iw9rfek wrote

Multiple reviewers pointed out that the empirical study is only limited to a modified ResNet and two datasets.

#
**mikeful**
t1_ivxqhgi wrote

Neat. You could try to initialize them to 0.1 or 0.9 as it's unlikely that weights will stay at zero or one after training anyway.

#
**VinnyVeritas**
t1_iw0w76i wrote

Seems useless, why not simply fix the seed of the random generator for reproducibility?

#
**master3243**
t1_iw1hggt wrote

The problem is not random variance between trained models.

Check out the abstract, it answers why this work is useful.

#
**VinnyVeritas**
t1_iw1s40n wrote

Like what? Training ultra-deep neural networks without batchnorm? But in their experiments the accuracy gets worse with deeper networks, what's the point of going deeper to get worse results?

#
**master3243**
t1_iw1x35r wrote

> They theoretically show that, different from naive identity mapping, their initialization methods can avoid training degeneracy when the network dimension increases. In addition, they empirically show that they can achieve better performance than random initializations on image classification tasks, such as CIFAR-10 and ImageNet. They also show some nice properties of the model trained by their initialization methods, such as low-rank and sparse solutions.

#
**VinnyVeritas**
t1_iw9ajwe wrote

The performance is not better: the results are the same within the margin of error for standard (not super-deep networks). Here I copied from their table:

Cifar10

ZerO Init 5.13 ± 0.08

Kaiming Init 5.15 ± 0.13

Imagenet

ZerO Init 23.43 ± 0.04

Kaiming Init 23.46 ± 0.07

#
**PredictorX1**
t1_iw2dy5h wrote

How does this compare to Murray Smith's weight initialization (1993)?

#
**starfries**
t1_iw90r3p wrote

What is that? I can't find a copy online.

#
**finitearth**
t1_ivylqpb wrote

Guess who's back

jrkirbyt1_ivx9xjl wroteWhat happens when all the weights to a ReLU neuron are 0? The ReLU function's derivative is discontinuous at zero. I figure in most practical situations this doesn't matter because the odds of many floating point numbers adding up to exactly 0.0 floating point is negligible. But this paper begs the question of what that would do. Is the derivative of ReLU at 0.0 equal to NaN, 0 or 1?