Submitted by HFSeven t3_10a8a14 in MachineLearning

Hi! So i am looking into literature for determining the usefulness of samples/datasets used for training ML model. Lets say DNN was trained with datasets A, B and C so after training is there way to quantify which of the partial triaining datasets contributed most to the useful learning by ML model at the end of training! Brute force strategy can be to remove samples and train and see how it performs but ofcourse it will not be viable!

5

Comments

You must log in or register to comment.

HateRedditCantQuitit t1_j42ogtm wrote

Not exactly what you’re asking, but active learning has a lot to say on data point usefulness.

3

Mysterious_Tekro t1_j42tidl wrote

If you look for the training data set quality and filtering that's another word for usefulness.

2

PassingTumbleweed t1_j432cw4 wrote

You need to clarify what you mean by "useful learning". Performance on some downstream task? You may be interested in meta-learning.

2

suflaj t1_j43eanz wrote

Well depends on what usefulness is.

If you can prove that all of your samples belong to the same distribution, then simply looking up which have the greatest gradient norm will be a measure of how useful they are for the model. Another approach is looking at how much their contribution would be in improving the performance of other samples, but then your dataset becomes a dependent variable.

But obviously this is dependent on the current weights, the loss function and various other biases. This is because gradient norm is proportional to the error, and so the samples for which the model predicts the most erroneous result will end up being most useful, given the perfect LR for it.

1

jonas__m t1_j45cym1 wrote

Data Shapely is one option but can be computationally expensive. If you’re looking for practical code to try running on real data, here are some tutorials to find the least useful data:

https://docs.cleanlab.ai/stable/tutorials/image.html

https://docs.cleanlab.ai/stable/tutorials/outliers.html

as well as the MOST useful data to label next (or collect an extra label for):

https://github.com/cleanlab/examples/blob/master/active_learning_multiannotator/active_learning.ipynb

2

HFSeven OP t1_j45jft0 wrote

Interesting! Will have look at it! Thanks

1

NiconiusX t1_j46a9v0 wrote

There is the idea of core sets in continual learning. Research on how to construct such core sets could be of your interest

2