Viewing a single comment thread. View all comments

elcric_krej t1_ivy6jf4 wrote

This is awesome in that it potentially removes a lot of random variance from the process of training, I think the rest of the benefits are comparatively small and safely ignorable.

I would love if it were picked up as a standard, it seems like the kind of thing that might get rid of a lot of the worst seed hacking out there.

But I'm an idiot, so I'm curios what well-informed people think about it.

13

master3243 t1_iw1h2h7 wrote

> potentially removes a lot of random variance from the process of training

You don't need the results of this paper for that.

One of my teams had a pipeline where every single script would initialize the seed of all random number generators (numpy, torch, pythons radom) to 42.

This essentially removed non-machine-precision stochasticity between different training iterations with the same inputs.

5

bluevase1029 t1_iw1khv8 wrote

I believe it's still difficult to be absolutely certain you have same initialisation across multiple machines, versions of pytorch etc. I could be wrong though.

4

master3243 t1_iw1mpgb wrote

Definitely if each person has a completely different setup.

But that's why we contenirize our setups and use a shared environment setup

2

elcric_krej t1_iw7hss0 wrote

I guess so, but that doesn't scale to more than one team (we did something similar) and arguably you want to test across multiple seeds, assume some init + model are just very odd minima.

This seems to yield higher uniformity without constraining us on the rng.

But see /u/DrXaos for why not really

1

DrXaos t1_iw7o3ef wrote

In my typical use, I’ve found that changing random init seeds (and also random seeds for shuffling examples during training, don’t forget that one) in many cases induces a larger variance on performance than many algorithmic or hyper parameter changes. Most prominently with imbalanced classification, which if often the reality of the valuable problem.

I guess it’s better to be lucky than smart.

Avoiding looking at the results of random init can make you think you’re smarter than you are and will tell yourselves false stories.

1

DrXaos t1_iw03k6k wrote

I’m not entirely convinced it eliminates every random choice. There is usually a permutation symmetry on tabular inputs, and among hidden nodes.

If I’m reading it correctly, then for a single scalar output of a regressor or classifier coming from hiddens or inputs directly (logreg), it would set the coefficient of the first node to 1 and 0 to all others being a truncated identity.

But what’s so special about that first element. Nothing. Same applies to the Hadamard matrices, it’s making one choice from an arbitrary ordering.

In my opinion, there still could/should be a random permutation of columns on interior weights and I might init the final linear layer of the classifier to equal but nonzero values like 1/sqrt(Nh), and with random sign if hidden activations are nonnegative like relu or sigmoid, instead of symmetric like tanh.

Maybe also random +1/-1 signs times random permutation times identity?

By that matter, any orthogonal rotation also preserves dynamical isometry, and so a random orthogonal before truncated identity should also work as init, and we’re back to an already existing suggested init method.

Training for enhanced sparsity is interesting, though.

3

samloveshummus t1_iw1oyhf wrote

>I would love if it were picked up as a standard, it seems like the kind of thing that might get rid of a lot of the worst seed hacking out there.

I don't want to be facetious, but what's wrong with "seed hacking"? Maybe that's a fundamental part of making a good model.

If we took someone other than Albert Einstein, and gave them the same education, the same career, the same influences and stresses, would that other person be equally as likely to realise how to explain the photoelectric effect, Brownian motion, blackbody radiation, general relativity and E=mc^(2)? Or was there something special about Einstein's genes meaning we need those initial conditions and that training schedule for it to work.

0

machinelearner77 t1_iw21k83 wrote

I guess the problem with "seed hacking" is just that it reduces trust in the proposed method. People want to build on methods that aren't brittle and if presented model performance depends (too) much on random seed it lowers trust in the method and makes people less likely to want to build on it

3

samloveshummus t1_iwhwh8y wrote

Sure, but maybe it's inescapable.

When we recruit for a job, we first select a candidate from CVs and interviews, and only once we've chosen a candidate do we begin training them.

Do you think it makes sense to strive for a recruitment process that will get perfect results from any candidate, so we can stop wasting time on interviews and just hire whoever? Or is it inevitable that we have to select among candidates before we begin the training? Why should it be different for computers?

1