Submitted by DreamyPen t3_zsbivc in MachineLearning
gBoostedMachinations t1_j17a22i wrote
- Set aside a validation set
- Use the rest of the data to train two models: One using the duplicates and one using the pre-computed means.
- Compare performance on the validation set.
Don’t put very much weight at all into what other people’s intuitions are about these kinds of questions. Just test it. Your question is an empirical one so just do the experiment. I can’t tell you how many times I’ve had a colleague say that something I was trying wasn’t going to work only to see that he was dead wrong when I tested it anyway. Oh man do I love it when that happens.
EDIT: it just occurred to me that validation will be somewhat tricky. Does OP allow (non-overlapping) duplicates to remain in the validation set? Or does OP calculate the averages for the targets? He can’t do something different when comparing the models, yet one model will be clearly favored if he only chooses one method.
I think the answer to the question depends on how data about future targets will be collected. Is OP going to perform repeated experiments in the future and take repeated measurements of the outcome? Or is he only going to perform unique sets of experiments? Whatever the answer the important thing is for OP to consider the future use-case and process his validation set in a way that most closely mimics that environment (e.g., repeated measurements vs single measurements).
Sorry if this isn’t very clear I only had a few minutes to type it out.
Mefaso t1_j1851kc wrote
>Set aside a validation set
Important: Ensure the duplicates are not shared between validation and train data
[deleted] t1_j1aizfn wrote
[deleted]
dimsycamore t1_j17bjkd wrote
This is honestly a better idea than any intuition I can give you.
Also anecdotally, I have encountered situations where one batch of replicates was either much lower quality or somehow different than the rest and we ended up dropping those by first finding them using an empirical setup similar to the one described above.
Just_CurioussSss t1_j18oqkh wrote
I agree. Using all of them are redundant. It would take too much space. It's better to compare performance on the validation set.
DyingGradient t1_j18jnbp wrote
lol what? wtf is this advice?
gBoostedMachinations t1_j18t1p3 wrote
What’s wrong with it?
Viewing a single comment thread. View all comments