Viewing a single comment thread. View all comments

DarwinianThunk t1_ivu6166 wrote

You could try weighted sampling or weighted loss function based on the "reliability" of the data.

6

DreamyPen OP t1_ivve7nt wrote

Could this be done by adding a feature column "weight" with a value ranging from 0 to 1. The closer to 1, the more reliable?

2

ObjectManagerManager t1_ivwdd0k wrote

No. Your model can do whatever it wants with input features. It's not going to just "choose" to treat this new column as a loss weight. Loss weighting requires a specific computation.

If you're training a neural network or something similar, you'd normally average the loss across every example in a batch, and then you'd backpropagate that averaged loss. With loss weighting, you compute a weighted average loss across the batch. In this case, you'd assign larger weights to the more "reliable" data points.

Sample weighting is different, and it can be done with virtually any ML model. It involves weighting the likelihood of sampling each data point. For "full-batch models", you can generate bootstrap samples with the weighted sampling. For "batched" models (e.g., neural networks trained via batched gradient descent), you can use weighted sampling for each batch.

Most modern ML packages have built-in interfaces for both of these, so there's no need to reinvent the wheel here.

2

Erosis t1_ivwar5d wrote

Yes, this is 'instance' or 'sample' weighting. You can choose to apply this weight to the loss or the gradients before your parameter update.

1

Brudaks t1_ivy46bv wrote

Sure, as long as you make a direct connection from that feature column to your loss calculation (i.e. multiply the calculated error metric by it) instead of solely using it as one of the inputs.

1