Submitted by optimized-adam t3_zay9gt in MachineLearning

In many papers, no confidence estimates are reported at all (one has to assume the best results for the own method are reported). In other papers, min/max or standard deviation as well as the mean are reported. Even more seldomly, the mean and standard error of the mean is reported. Once in a blue moon, an actual statistical test is run.

Given that there plainly is no consensus in the field on how to handle this issue, what is the best way to do it in your opinion?

15

Comments

You must log in or register to comment.

abio93 t1_iyqjfv5 wrote

In an ideal word the code and the intermediate results (ALL of them, also the ones not used in the final paper) should be available

8

SufficientStautistic t1_iyqag72 wrote

I am always delighted to see a median and accompanying central 5 and 95% quantiles at each validation step/end of each epoch. This is more helpful to me than some multiple of the s.d. A mean with SE goes a lot further than many papers, so even that I will take, just give us a measure of variance, for the love of god haha.

The answer saying that random weight initialization is not ideal is a good one, it's a pain both for reproducibility and other reasons (saw you ask about this in that thread - the variance of random initialisation has to be tuned based on depth so that the io condition number is about 1, otherwise learning is less likely to proceed as quickly or at all). Several deterministic initialisation procedures have been proposed over the years. Here is one from last year that yielded promising results and had some theoretical rationale: https://arxiv.org/abs/2110.12661

Unfortunately their proposed approach isn't available out-of-the-box with TF or PyTorch, but it shouldn't be too tough to implement by hand if you have the time.

2

optimized-adam OP t1_iyqdvbc wrote

Thank you for your answer! Isn't the SE just the sample standard deviation divided by the square root of n?

2

xtof54 t1_iyq2oxi wrote

good question but it depends on whether this source of randomness occurs between both models been compared at test time. or in other words what kind of generalization you want to support.

this contrasts with variability due to sampling data because we all assume data are iid, and so a confidence interval is usually computed.

one way is to fix the seed, compare the models with same seed, report significance for data sampling, and restart, and globally report proportion of significance across seeds.

but we shouldn't pay too much attention to stat significance, too many use it as a 'flag of truth', while all experiments are biased anyway, so better to always be suspicious and build confidence over time

1

Phoneaccount25732 t1_iyq5oab wrote

I think it's fine the way it is now. ML models are very statistical physics-y. Variation from run to run is extremely low.

1

Superschlenz t1_iypuia6 wrote

In an optimal world there would be no random weights initialisation or other usages of pseudo random number generators.

0

Oceanboi t1_iypwn9c wrote

Could you elaborate on why? Just curious. What is the alternative?

1

Superschlenz t1_iypxi3i wrote

>Could you elaborate on why?

Because random noise basically means "We do not understand the real causes," and a solution cannot be optimal if different random seeds lead to different performance results.

>What is the alternative?

I am not competent enough to answer that, but basically the random seed is a hyperparameter and an optimal learning algorithm should have zero hyperparameters at all, so that everything depends on the user data and learning is not hampered by the wrong hyperparameter choice of the developer. Maybe Bayesian Optimization with a yet-to-invent way to stack them against the curse of high-dimensional data.

0

Oceanboi t1_iyq03g2 wrote

Why do you say an optimal learning algorithm should have zero hyperparameters? Are you saying an optimal neural network would learn things like batch size, learning rate, optimal optimizer (lol), input size, etc, on its own? In this case wouldn't a model with zero hyperparameters be the same conceptually as a model that has been tuned to the optimal hyperparameter combination?

Theoretically you could make these hyperparameters trainable if you had the coding chops, so why are we still as a community tweaking hyperparameters iteratively?

1

Superschlenz t1_iyq5oy5 wrote

>Why do you say an optimal learning algorithm should have zero hyperparameters?

Because hyperparameters are fixed by the developer, and so the developer must know the user's environment in order to tune them, but if it requires a developer then it is programming and not learning.

>Are you saying an optimal neural network would learn things like batch size, learning rate, optimal optimizer (lol), input size, etc, on its own?

An optimal learning algorithm wouldn't have those hyperparameters at all, not even static hardware.

>In this case wouldn't a model with zero hyperparameters be the same conceptually as a model that has been tuned to the optimal hyperparameter combination?

Users do not tune hyperparameters, and developers do not know the user's environment. The agent can be broadly pretrained at the developer's laboratory to speed up learning at the user's site, but finally it has to learn on its own at the user's site without a developer being around.

>Theoretically you could make these hyperparameters trainable if you had the coding chops, so why are we still as a community tweaking hyperparameters iteratively?

Because you as a community have been forced to decide for a job when you were 14 years old, and you chose to become a machine learning engineer because you were more talented than others, and now you are performing the show of the useful engineer.

−1

Optimal-Asshole t1_iyqtp3l wrote

No, the reason for hyper parameter optimization isn’t job security. It’s because choosing better hyper parameters will produce better results which has more success in applications. There are people working on automatic hyperparameter optimization.

But let’s not act like it’s due solely due to some community caused phenomenon and engineers putting on a show. Honestly your message comes off as a little bitter.

2