jrkirby t1_ivx9xjl wrote on November 11, 2022 at 8:02 AM

#499,708

What happens when all the weights to a ReLU neuron are 0? The ReLU function's derivative is discontinuous at zero. I figure in most practical situations this doesn't matter because the odds of many floating point numbers adding up to exactly 0.0 floating point is negligible. But this paper begs the question of what that would do. Is the derivative of ReLU at 0.0 equal to NaN, 0 or 1?

canbooo t1_ivx9yjn wrote on November 11, 2022 at 8:02 AM

#499,709

Very interesting stuff, just skimmed through and will definitely read more in depth but how does this break symmetry?

Bot-69912020 t1_ivxbxml wrote on November 11, 2022 at 8:30 AM

#499,771

Replying to jrkirby (#499,708)

I don't know about each specific implementation, but via the definition of subgradients you can get 'derivatives' of convex but non-differentiable functions (which ReLU is).

More formally: A subgradient at a point x of a convex function f is any x' such that f(y) >= f(x) + < x', y - x > for all y. The set of all possible subgradients at a point x is called the subdifferential of f at x.

For more details, see here.

maybelator t1_ivxgacq wrote on November 11, 2022 at 9:35 AM

#499,898

Replying to jrkirby (#499,708)

> Is the derivative of ReLU at 0.0 equal to NaN, 0 or 1?

The derivative of ReLu is not defined at 0, but its subderivative is and is the set [0,1].

You can pick any value in this set, and you end up with (stochastic) subgradient descent, which converges for small enough learning rates (to a critical point).

For ReLu, the discontinuity are of mass 0 and are not "attractive", ie there is no reason for the iterate to end up exactly at 0, so it can be safely ignored. This is not the case for the L1 norm for example, whose subgradient at 0 is [-1,1]. It present a "kink" at 0 as the subderivative contains a neighborhood of 0, and hence is attractive: your iterate will get stuck there. In these cases, it is recommended to use proximal algorithms, typically forward-backward schemes.

mikeful t1_ivxqhgi wrote on November 11, 2022 at 11:55 AM

#500,270

Neat. You could try to initialize them to 0.1 or 0.9 as it's unlikely that weights will stay at zero or one after training anyway.

[deleted] t1_ivxslaq wrote on November 11, 2022 at 12:18 PM

#500,357

Replying to jrkirby (#499,708)

[deleted]

martinkunev t1_ivxtknz wrote on November 11, 2022 at 12:29 PM

#500,405

The abstract looks very promising. I'm wondering why there is just 1 citation in 4 months. Is there a caveat?

9182763498761234 t1_ivy1mud wrote on November 11, 2022 at 1:43 PM

#500,825

Replying to maybelator (#499,898)

Cool, thanks for sharing :-)

new_name_who_dis_ t1_ivy2et6 wrote on November 11, 2022 at 1:49 PM

#500,865

Replying to martinkunev (#500,405)

Getting lots of citations a few month after your paper comes out only happens with papers written by famous researchers. Normal people need to work to get people to notice their research (which is they are sharing it here now).

And usually a paper starts getting citations after it’s already been presented at a conference where you can do the most easiest promotion of it.

ThisIsMyStonerAcount t1_ivy34sr wrote on November 11, 2022 at 1:55 PM

#500,905

Replying to jrkirby (#499,708)

Knowing about subgradients (see other answers) is nice and all, but in the real world what matters is what your framework does. Last time I checked, both pytorch and jax say that the derivative of max(x, 0) is 0 when x=0.

jimmiebtlr t1_ivy59yw wrote on November 11, 2022 at 2:11 PM

#501,020

Replying to canbooo (#499,709)

Haven’t read it yet, but wouldnt symmetry only exist for 2 node if the input and output weights have the same 1s and 0s?

elcric_krej t1_ivy6jf4 wrote on November 11, 2022 at 2:21 PM

#501,090

This is awesome in that it potentially removes a lot of random variance from the process of training, I think the rest of the benefits are comparatively small and safely ignorable.

I would love if it were picked up as a standard, it seems like the kind of thing that might get rid of a lot of the worst seed hacking out there.

But I'm an idiot, so I'm curios what well-informed people think about it.

terranop t1_ivyafa1 wrote on November 11, 2022 at 2:49 PM

#501,351

Replying to new_name_who_dis_ (#500,865)

While what you are saying here is true, it doesn't really apply in this case because Anima Anandkumar is a famous researcher.

new_name_who_dis_ t1_ivybdq7 wrote on November 11, 2022 at 2:56 PM

#501,411

Replying to terranop (#501,351)

Oh I didn’t know them. Still if it’s only been out a few months for it to be cited it would have needed to be noticed by someone who is writing their next research paper and have that paper already published.

Unless preprints on arxiv count. But even then it takes weeks if not months to do research and write a paper. So that leaves such a small window for possible citations at this point.

Phoneaccount25732 t1_ivydmgs wrote on November 11, 2022 at 3:11 PM

#501,532

Replying to maybelator (#499,898)

I want more comments like this.

canbooo t1_ivydtlt wrote on November 11, 2022 at 3:13 PM

#501,539

Replying to jimmiebtlr (#501,020)

You are right and what I ask may be practically irrelevant and I really should rtfp. However, think about the edge case of 1 Layer with 1 input and 1 output. Each node having 1 as input weight sees the same gradient, similar to the nodes having 0. Increasing the number of inputs make it combinatorially improbable to have the same configuration but increasing the number of nodes in a layer makes it likelier. So, it could be relevant for low dimensions or models with a narrow bottleneck. I am sure that the authors already thought about this problem and either discarded it as it is quite unlikely in their tested settings or they already have a solution/analysis somewhere in the paper, hence my question.

finitearth t1_ivylqpb wrote on November 11, 2022 at 4:07 PM

#501,954

Guess who's back

vjb_reddit_scrap t1_ivymo0p wrote on November 11, 2022 at 4:13 PM

#502,002

IIRC Hinton et al had a paper about initializing RNNs with identity and it solved many problems that LSTM solves.

robbsc t1_ivypqg0 wrote on November 11, 2022 at 4:34 PM

#502,169

Replying to maybelator (#499,898)

Thanks for taking the time to type this out

DrXaos t1_iw03k6k wrote on November 11, 2022 at 10:13 PM

#504,904

Replying to elcric_krej (#501,090)

I’m not entirely convinced it eliminates every random choice. There is usually a permutation symmetry on tabular inputs, and among hidden nodes.

If I’m reading it correctly, then for a single scalar output of a regressor or classifier coming from hiddens or inputs directly (logreg), it would set the coefficient of the first node to 1 and 0 to all others being a truncated identity.

But what’s so special about that first element. Nothing. Same applies to the Hadamard matrices, it’s making one choice from an arbitrary ordering.

In my opinion, there still could/should be a random permutation of columns on interior weights and I might init the final linear layer of the classifier to equal but nonzero values like 1/sqrt(Nh), and with random sign if hidden activations are nonnegative like relu or sigmoid, instead of symmetric like tanh.

Maybe also random +1/-1 signs times random permutation times identity?

By that matter, any orthogonal rotation also preserves dynamical isometry, and so a random orthogonal before truncated identity should also work as init, and we’re back to an already existing suggested init method.

Training for enhanced sparsity is interesting, though.

DrXaos t1_iw04agd wrote on November 11, 2022 at 10:18 PM

#504,932

Replying to vjb_reddit_scrap (#502,002)

That’s a different scenario and clearly dynamically justified.

Any recursive neural network is like a nonlinear dynamical system. Learning happens best on the boundary of dissipation vs chaos (exploding or vanishing gradients).

The additive incorporation of new info in LSTM/GRU greatly ameliorates that usual problem of RNNs with random transition matrices where perturbations evolve multiplicatively. RNN initted to zero Lyapunov exponent through identity is helpful.

VinnyVeritas t1_iw0w76i wrote on November 12, 2022 at 2:00 AM

#506,680

Seems useless, why not simply fix the seed of the random generator for reproducibility?

master3243 t1_iw1h2h7 wrote on November 12, 2022 at 5:17 AM

#507,778

Replying to elcric_krej (#501,090)

> potentially removes a lot of random variance from the process of training

You don't need the results of this paper for that.

One of my teams had a pipeline where every single script would initialize the seed of all random number generators (numpy, torch, pythons radom) to 42.

This essentially removed non-machine-precision stochasticity between different training iterations with the same inputs.

master3243 t1_iw1hggt wrote on November 12, 2022 at 5:21 AM

#507,792

Replying to VinnyVeritas (#506,680)

The problem is not random variance between trained models.

Check out the abstract, it answers why this work is useful.

bluevase1029 t1_iw1khv8 wrote on November 12, 2022 at 5:55 AM

#507,944

Replying to master3243 (#507,778)

I believe it's still difficult to be absolutely certain you have same initialisation across multiple machines, versions of pytorch etc. I could be wrong though.

master3243 t1_iw1mpgb wrote on November 12, 2022 at 6:22 AM

#508,053

Replying to bluevase1029 (#507,944)

Definitely if each person has a completely different setup.

But that's why we contenirize our setups and use a shared environment setup

samloveshummus t1_iw1o1jg wrote on November 12, 2022 at 6:39 AM

#508,110

Replying to maybelator (#499,898)

This has to be one of the most useful comments I've read in nearly ten years on Reddit! You must be a gifted teacher.

samloveshummus t1_iw1ofup wrote on November 12, 2022 at 6:44 AM

#508,135

Replying to ThisIsMyStonerAcount (#500,905)

Good point. But it's not the end of the world; those frameworks are open source, after all!

samloveshummus t1_iw1oyhf wrote on November 12, 2022 at 6:51 AM

#508,162

Replying to elcric_krej (#501,090)

>I would love if it were picked up as a standard, it seems like the kind of thing that might get rid of a lot of the worst seed hacking out there.

I don't want to be facetious, but what's wrong with "seed hacking"? Maybe that's a fundamental part of making a good model.

If we took someone other than Albert Einstein, and gave them the same education, the same career, the same influences and stresses, would that other person be equally as likely to realise how to explain the photoelectric effect, Brownian motion, blackbody radiation, general relativity and E=mc^(2)? Or was there something special about Einstein's genes meaning we need those initial conditions and that training schedule for it to work.

samloveshummus t1_iw1qaft wrote on November 12, 2022 at 7:08 AM

#508,234

Replying to martinkunev (#500,405)

As well as what the other commenters are saying, sometimes deeper stuff takes longer to have an impact. If you look through the history of science (and human endeavor more generally), there are many famous examples of people whose work revolutionized our modern world, but who weren't recognized in their lifetime - society needed time to catch up.

Now I think we can do a lot better than that. We're a global civilization that communicates at lightspeed. However, we are still also big hairless apes with CPUs made of electric jelly, so we take a while to process things. The more unexpected, the more processing we need.

VinnyVeritas t1_iw1s40n wrote on November 12, 2022 at 7:33 AM

#508,302

Replying to master3243 (#507,792)

Like what? Training ultra-deep neural networks without batchnorm? But in their experiments the accuracy gets worse with deeper networks, what's the point of going deeper to get worse results?

master3243 t1_iw1x35r wrote on November 12, 2022 at 8:44 AM

#508,511

Replying to VinnyVeritas (#508,302)

> They theoretically show that, different from naive identity mapping, their initialization methods can avoid training degeneracy when the network dimension increases. In addition, they empirically show that they can achieve better performance than random initializations on image classification tasks, such as CIFAR-10 and ImageNet. They also show some nice properties of the model trained by their initialization methods, such as low-rank and sparse solutions.

machinelearner77 t1_iw21k83 wrote on November 12, 2022 at 9:51 AM

#508,678

Replying to samloveshummus (#508,162)

I guess the problem with "seed hacking" is just that it reduces trust in the proposed method. People want to build on methods that aren't brittle and if presented model performance depends (too) much on random seed it lowers trust in the method and makes people less likely to want to build on it

PredictorX1 t1_iw2dy5h wrote on November 12, 2022 at 12:41 PM

#509,280

How does this compare to Murray Smith's weight initialization (1993)?

[deleted] t1_iw2kgbt wrote on November 12, 2022 at 1:47 PM

#509,743

Replying to maybelator (#499,898)

[deleted]

elcric_krej t1_iw7hss0 wrote on November 13, 2022 at 3:41 PM

#518,880

Replying to master3243 (#507,778)

I guess so, but that doesn't scale to more than one team (we did something similar) and arguably you want to test across multiple seeds, assume some init + model are just very odd minima.

This seems to yield higher uniformity without constraining us on the rng.

But see /u/DrXaos for why not really

DrXaos t1_iw7o3ef wrote on November 13, 2022 at 4:25 PM

#519,224

Replying to elcric_krej (#518,880)

In my typical use, I’ve found that changing random init seeds (and also random seeds for shuffling examples during training, don’t forget that one) in many cases induces a larger variance on performance than many algorithmic or hyper parameter changes. Most prominently with imbalanced classification, which if often the reality of the valuable problem.

I guess it’s better to be lucky than smart.

Avoiding looking at the results of random init can make you think you’re smarter than you are and will tell yourselves false stories.

starfries t1_iw90r3p wrote on November 13, 2022 at 9:39 PM

#521,556

Replying to PredictorX1 (#509,280)

What is that? I can't find a copy online.

VinnyVeritas t1_iw9ajwe wrote on November 13, 2022 at 10:47 PM

#521,998

Replying to master3243 (#508,511)

The performance is not better: the results are the same within the margin of error for standard (not super-deep networks). Here I copied from their table:

Cifar10

ZerO Init 5.13 ± 0.08
Kaiming Init 5.15 ± 0.13

Imagenet

ZerO Init 23.43 ± 0.04
Kaiming Init 23.46 ± 0.07

lynnharry t1_iw9rfek wrote on November 14, 2022 at 12:51 AM

#522,745

Replying to martinkunev (#500,405)

Multiple reviewers pointed out that the empirical study is only limited to a modified ResNet and two datasets.

zimonitrome t1_iwbmzoq wrote on November 14, 2022 at 1:13 PM

#525,668

Replying to maybelator (#499,898)

Huber loss let's go.

maybelator t1_iwbpkjo wrote on November 14, 2022 at 1:36 PM

#525,804

Replying to zimonitrome (#525,668)

Not if you want true sparsity !

zimonitrome t1_iwbst8p wrote on November 14, 2022 at 2:04 PM

#525,984

Replying to maybelator (#525,804)

Can you elaborate?

maybelator t1_iwbxutj wrote on November 14, 2022 at 2:43 PM

#526,312

Replying to zimonitrome (#525,984)

The Huber loss encourages the regularized variable to be close to 0. However, this loss is also smooth: the amplitude of the gradient decreases as the variable nears its stationary point. In consequence, it will have many coordinates close to 0 but not exactly. Achieving true sparsity requires thresholding which adds a a lot of other complications.

In contrast the amplitude of the gradient of the L1 norm (absolute value in dim 1) remain the same no matter how close it gets to 0. The functional has a kink (the subgradient contains a neighborhood of 0). In consequence, if you used a well-suited optimization algorithm, the variable will have true sparsity, i.e. a lot of exact 0.

zimonitrome t1_iwc14i5 wrote on November 14, 2022 at 3:07 PM

#526,503

Replying to maybelator (#526,312)

Wow thanks for the explanation, it does make sense.

I had a pre-conception that all optimizers dealing with any linear functions (kinda like L1 norm) still produce values close to 0.

I can see someone disregarding tiny values when using said sparsity (pruning, quantization) but didn't think that it would be exactly 0.

AnimaAnandkumar t1_iwe93vq wrote on November 15, 2022 at 12:06 AM

#530,862

Thank you for posting our paper. These slides sum up our work and how it removes degeneracy arising from identity initialization https://twitter.com/AnimaAnandkumar/status/1590963759954423810?s=20&t=8V3J8VOrbn1w-rZY_Lplqg

https://twitter.com/AnimaAnandkumar/status/1590963759954423810?s=20&t=8V3J8VOrbn1w-rZY_Lplqg

samloveshummus t1_iwhwh8y wrote on November 15, 2022 at 7:32 PM

#538,585

Replying to machinelearner77 (#508,678)

Sure, but maybe it's inescapable.

When we recruit for a job, we first select a candidate from CVs and interviews, and only once we've chosen a candidate do we begin training them.

Do you think it makes sense to strive for a recruitment process that will get perfect results from any candidate, so we can stop wasting time on interviews and just hire whoever? Or is it inevitable that we have to select among candidates before we begin the training? Why should it be different for computers?

Comments