Submitted by DreamyPen t3_zsbivc in MachineLearning

I have collected experimental data for various conditions. In order to ensure repeatability, each test is replicated 5 times: which means same input but slightly different output due to experimental variability.

If you were to build a machine learning algorithm, would you use all 5 data points for each given test, hoping that your algorithm will learn to converge towards the mean response? Or it is advisable to pre-compute the means and only feed these to the model? ( so that you ensure that one input can only have one output)

I can see pros and cons to both approches and would welcome feedback. Thank you.

50

Comments

You must log in or register to comment.

gBoostedMachinations t1_j17a22i wrote

  1. Set aside a validation set
  2. Use the rest of the data to train two models: One using the duplicates and one using the pre-computed means.
  3. Compare performance on the validation set.

Don’t put very much weight at all into what other people’s intuitions are about these kinds of questions. Just test it. Your question is an empirical one so just do the experiment. I can’t tell you how many times I’ve had a colleague say that something I was trying wasn’t going to work only to see that he was dead wrong when I tested it anyway. Oh man do I love it when that happens.

EDIT: it just occurred to me that validation will be somewhat tricky. Does OP allow (non-overlapping) duplicates to remain in the validation set? Or does OP calculate the averages for the targets? He can’t do something different when comparing the models, yet one model will be clearly favored if he only chooses one method.

I think the answer to the question depends on how data about future targets will be collected. Is OP going to perform repeated experiments in the future and take repeated measurements of the outcome? Or is he only going to perform unique sets of experiments? Whatever the answer the important thing is for OP to consider the future use-case and process his validation set in a way that most closely mimics that environment (e.g., repeated measurements vs single measurements).

Sorry if this isn’t very clear I only had a few minutes to type it out.

105

Mefaso t1_j1851kc wrote

>Set aside a validation set

Important: Ensure the duplicates are not shared between validation and train data

35

dimsycamore t1_j17bjkd wrote

This is honestly a better idea than any intuition I can give you.

Also anecdotally, I have encountered situations where one batch of replicates was either much lower quality or somehow different than the rest and we ended up dropping those by first finding them using an empirical setup similar to the one described above.

12

Just_CurioussSss t1_j18oqkh wrote

I agree. Using all of them are redundant. It would take too much space. It's better to compare performance on the validation set.

1

MrAcurite t1_j17amh6 wrote

Just make sure not to let the duplicates result in bleeding between the training and test sets.

24

dimsycamore t1_j178fqp wrote

I would recommend using all of the replicates. The model should learn the expectation sans any mean-zero noise that might vary between them. Basing this on a hand wavy interpretation of some results from the original noise2noise paper and more recent work on SSL. You can even consider each replicate an "augmentation" of your ground truth mean and use principles of SSL to enforce consistency between the replicates.

11

Eresbonitaguey t1_j179g29 wrote

Agreed. It’s pretty common to augment your data so that you have n different inputs based on a the same original input. As long as these augmented values aren’t present in your test set then you should be fine.

4

The_Bundaberg_Joey t1_j17uvsj wrote

Sounds like the type of problem a Gaussian Process model would be well suited to as it considers a level of noise within the training data in the first place.

It’s usage however is very dependent on the amount and type of data you’re working with so I think u/gBoostedMachinations has the best approach to this problem without knowing more about your data.

5

JimmyTheCrossEyedDog t1_j17e3k4 wrote

Great responses so far. One other thing to consider is the purpose of this model. Will it be used to make inferences on out-of-sample data? If so, you should make sure that the form of data you're training on is representative of the form of data you'll have operationally.

In other words, will the out of sample data also have five replicates like you have for your training data? If not, then you should train using all five replicates, not an average. Otherwise, your out-of-sample data will have variance that has been averaged out by your training process in a way you cannot perform on the new data.

If this model isn't for prediction, you have more flexibility.

4

new_name_who_dis_ t1_j190nqx wrote

Having noisy targets is a known augmentation technique, so I don't think it's a problem.

3

edunuke t1_j1sku2e wrote

Check out: repeated-measures datasets, and hierarchical datasets. Might be useful.

2

Seankala t1_j17r2we wrote

Also make sure to change your random seed for each run and calculate the mean and variance for each runs' performance on the test set. As a principal, you should always set aside a test set that you never touch other than for performance purposes.

1

bitemenow999 t1_j19lvu5 wrote

I would use all 5 data points in put it through a VAE type architecture with a prediction head.

1

rlvsdlvsml t1_j17frtb wrote

Ugh if u have 5-50 test cases u need stats not ml and absolutely should not be using duplicates. Should probably use some classic stats model with groups like glm glme

−1