master3243 t1_iw1h2h7 wrote on November 12, 2022 at 5:17 AM

> potentially removes a lot of random variance from the process of training

You don't need the results of this paper for that.

One of my teams had a pipeline where every single script would initialize the seed of all random number generators (numpy, torch, pythons radom) to 42.

This essentially removed non-machine-precision stochasticity between different training iterations with the same inputs.

bluevase1029 t1_iw1khv8 wrote on November 12, 2022 at 5:55 AM

I believe it's still difficult to be absolutely certain you have same initialisation across multiple machines, versions of pytorch etc. I could be wrong though.

master3243 t1_iw1mpgb wrote on November 12, 2022 at 6:22 AM

Definitely if each person has a completely different setup.

But that's why we contenirize our setups and use a shared environment setup

elcric_krej t1_iw7hss0 wrote on November 13, 2022 at 3:41 PM

I guess so, but that doesn't scale to more than one team (we did something similar) and arguably you want to test across multiple seeds, assume some init + model are just very odd minima.

This seems to yield higher uniformity without constraining us on the rng.

But see /u/DrXaos for why not really

DrXaos t1_iw7o3ef wrote on November 13, 2022 at 4:25 PM

In my typical use, I’ve found that changing random init seeds (and also random seeds for shuffling examples during training, don’t forget that one) in many cases induces a larger variance on performance than many algorithmic or hyper parameter changes. Most prominently with imbalanced classification, which if often the reality of the valuable problem.

I guess it’s better to be lucky than smart.

Avoiding looking at the results of random init can make you think you’re smarter than you are and will tell yourselves false stories.

DrXaos t1_iw03k6k wrote on November 11, 2022 at 10:13 PM

I’m not entirely convinced it eliminates every random choice. There is usually a permutation symmetry on tabular inputs, and among hidden nodes.

If I’m reading it correctly, then for a single scalar output of a regressor or classifier coming from hiddens or inputs directly (logreg), it would set the coefficient of the first node to 1 and 0 to all others being a truncated identity.

But what’s so special about that first element. Nothing. Same applies to the Hadamard matrices, it’s making one choice from an arbitrary ordering.

In my opinion, there still could/should be a random permutation of columns on interior weights and I might init the final linear layer of the classifier to equal but nonzero values like 1/sqrt(Nh), and with random sign if hidden activations are nonnegative like relu or sigmoid, instead of symmetric like tanh.

Maybe also random +1/-1 signs times random permutation times identity?

By that matter, any orthogonal rotation also preserves dynamical isometry, and so a random orthogonal before truncated identity should also work as init, and we’re back to an already existing suggested init method.

Training for enhanced sparsity is interesting, though.

samloveshummus t1_iw1oyhf wrote on November 12, 2022 at 6:51 AM

>I would love if it were picked up as a standard, it seems like the kind of thing that might get rid of a lot of the worst seed hacking out there.

I don't want to be facetious, but what's wrong with "seed hacking"? Maybe that's a fundamental part of making a good model.

If we took someone other than Albert Einstein, and gave them the same education, the same career, the same influences and stresses, would that other person be equally as likely to realise how to explain the photoelectric effect, Brownian motion, blackbody radiation, general relativity and E=mc^(2)? Or was there something special about Einstein's genes meaning we need those initial conditions and that training schedule for it to work.

machinelearner77 t1_iw21k83 wrote on November 12, 2022 at 9:51 AM

I guess the problem with "seed hacking" is just that it reduces trust in the proposed method. People want to build on methods that aren't brittle and if presented model performance depends (too) much on random seed it lowers trust in the method and makes people less likely to want to build on it

samloveshummus t1_iwhwh8y wrote on November 15, 2022 at 7:32 PM

Sure, but maybe it's inescapable.

When we recruit for a job, we first select a candidate from CVs and interviews, and only once we've chosen a candidate do we begin training them.

Do you think it makes sense to strive for a recruitment process that will get perfect results from any candidate, so we can stop wasting time on interviews and just hire whoever? Or is it inevitable that we have to select among candidates before we begin the training? Why should it be different for computers?

[R] ZerO Initialization: Initializing Neural Networks with only Zeros and Ones

elcric_krej t1_ivy6jf4 wrote on November 11, 2022 at 2:21 PM