Submitted by DreamyPen t3_yrjjql in MachineLearning

Hi all,

I will be careful not the use the term "confidence" to keep the goal clear and not confuse with confidence interval or predictions.

I have two sources of data. One very reliable (experimental), and another source less so, but still carries useful information.

Is it possible to feed the entirety of the data to an algorithm while specifying a certain "trust" or "reliability" in the data source? The goal being putting more weight on the reliable source, while still picking up some hidden patterns from the second source?

12

Comments

You must log in or register to comment.

ResponsibilityNo7189 t1_ivtxak2 wrote

You can maybe use the less reliable data in pretraining, then only use your trusted data for finetuning.

19

emad_eldeen t1_ivw8m77 wrote

Also, maybe use both sources for pretraining with a self-supervised method and fine-tune only with the reliable one. Check time-series SSL methods such as TS-TCC or TS2VEC for pretraining.

3

Erosis t1_ivtxuwn wrote

Are the outputs of your model binary? You could instead set the target of your uncertain data points to somewhere closer to the middle instead of 0 and 1.

If you are training in batches, you could reduce the size of the gradient updates coming from the unreliable data.

7

DreamyPen OP t1_ivty1ow wrote

Unfortunately not, I'm predicting material properties on the continuous scale.

1

Erosis t1_ivtyokc wrote

You could use a custom training loop where you down-weight the gradients of the unreliable samples before you do parameter updates.

5

DreamyPen OP t1_ivu0v9o wrote

Thank you for your comment. I am not sure what that custom loop would look like for an ensemble method (trees/gradient boosted), and how to proceed with down-weighing? Is it a documented technique I can read more about, or more of a workaound you are thinking of?

1

Erosis t1_ivu2gnv wrote

Trees complicate it a bit more. I've never done it for something like that, but check this instance weight input to xgboost as an example. In the xgboost fit function, there is an input for sample_weight.

I know that tensorflow has a new-ish library for trees. You could manually write a gradient descent loop with modified minibatch gradients there, potentially.

1

RSchaeffer t1_ivuizy4 wrote

The topic you're looking for is "weak supervision."

7

DarwinianThunk t1_ivu6166 wrote

You could try weighted sampling or weighted loss function based on the "reliability" of the data.

6

DreamyPen OP t1_ivve7nt wrote

Could this be done by adding a feature column "weight" with a value ranging from 0 to 1. The closer to 1, the more reliable?

2

ObjectManagerManager t1_ivwdd0k wrote

No. Your model can do whatever it wants with input features. It's not going to just "choose" to treat this new column as a loss weight. Loss weighting requires a specific computation.

If you're training a neural network or something similar, you'd normally average the loss across every example in a batch, and then you'd backpropagate that averaged loss. With loss weighting, you compute a weighted average loss across the batch. In this case, you'd assign larger weights to the more "reliable" data points.

Sample weighting is different, and it can be done with virtually any ML model. It involves weighting the likelihood of sampling each data point. For "full-batch models", you can generate bootstrap samples with the weighted sampling. For "batched" models (e.g., neural networks trained via batched gradient descent), you can use weighted sampling for each batch.

Most modern ML packages have built-in interfaces for both of these, so there's no need to reinvent the wheel here.

2

Erosis t1_ivwar5d wrote

Yes, this is 'instance' or 'sample' weighting. You can choose to apply this weight to the loss or the gradients before your parameter update.

1

Brudaks t1_ivy46bv wrote

Sure, as long as you make a direct connection from that feature column to your loss calculation (i.e. multiply the calculated error metric by it) instead of solely using it as one of the inputs.

1

LurkAroundLurkAround t1_ivup7pc wrote

By far the easiest thing to do is to feed in the data source as a feature. This should allow the model to generalize across datasets as much as possible, while accounting for different inherent properties of the data

5

ObjectManagerManager t1_ivwf1f0 wrote

Alternatively, feed the data source as an output. i.e., have your model output two values. For data sourced from dataset A, minimize loss against the first output. For data sourced from dataset B, minimize loss against the second output.

I don't remember who, but someone wrote a thesis on how it often works better in practice to incorporate additional / auxiliary information in the form of outputs rather than inputs. It's also a very clean solution since you can usually just remove the unnecessary output heads after training, which might decrease your model size for inference (albeit a small amount, unless you have a lot of auxiliary information).

1

_Arsenie_Boca_ t1_ivufzxw wrote

As others have pointed out, sample weights could be used. Another option would be to smoothen the labels of the unreliable source

3

DreamyPen OP t1_ivveamf wrote

Can I ask you what you mean by "smoothen labels"?

1

_Arsenie_Boca_ t1_ivvfzfs wrote

In classification you usually have a single correct class, a hard label. However, you might also have soft labels, where multiple classes have non-zero target probabilities. Label smoothing is a technique that artificially introduces those soft labels from hard labels, i.e. if your hard label was [0 0 1 0] it might now be [0.05 0.05 0.85 0.05]. You could use the strength of smoothing to represent uncertainty.

1

DreamyPen OP t1_ivvgkqs wrote

Thank your the clarification. I'm dealing with a regression problem however. Not sure its applicable in my case.

1

malenkydroog t1_ivvrhfs wrote

Look into the area called sensor fusion for algorithms and things.

There are multiple ways people do this. For example, you could use something like factor analysis, where the factor loadings onto the latent "true" signal/factor are fixed based on what you (presumably) know about the empirical reliability/error variance and (potentially) bias in each observed signal. Then you do your modeling with the inferred latent "true" signal. See the second example here to see that sort of approach in the context of a gaussian process model.

2

Far-Butterscotch-436 t1_ivuhdm4 wrote

Easy, use all the training data, use smaller label weights for the uncertain data. But keep in mind, if the data is uncertain how can you trust it??? If you say the label is uncertain is there a probability that the label is incorrect? How will you measure performance on your uncertain data vs certain? Boosting algorithms will certainly overfit , it will be difficult.

1

Ulfgardleo t1_ivvfwzf wrote

it is not so easy. if we talk about noise in the input patterns and not in the labels then the noise inputs can be catastrophic to model performance. in that case the model needs to know what data source the input is from.

1

todeedee t1_ivuohkf wrote

How do you know which data source is more reliable? Can you "calibrate" your model with experiments that have determined this?

1

Ulfgardleo t1_ivvfosf wrote

Okay, there are a few questions:

  1. what is unreliable: the inputs, or the labels? Is your problem even supervised?
  2. What do you want to learn?
  3. Is it possible to quantify reliability for each source? Is it just higher variance or also bias?
  4. Does there exist cases for which you have reliable and unreliable data?
  5. What is the data you finally predict on? the reliable or unreliable data?
1

DreamyPen OP t1_ivvhmu0 wrote

  1. There are two sources of data. One experimental measurements with small amount of scatter, so it is considered highly reliable data. The second source is data predicted using physics-based models. They are sometimes quite accurate, sometimes a bit off. So it is indeed a supervised problem, with unreliable outputs not labels.
  2. I'm learning material properties. Ideally able to learn from the experimental data (ground truth), while capturing the trends from the synthetic model-based data.
  3. The experimental data is always considered highly reliable. The model-based data can be accurate or not, so a fixed reliability score should be suitable without knowing with certainty whether the models prediction is reliable or not for given input.
  4. Answered previously.
  5. We are mainly interested in predicting material properties that are close to the experimental (reliable) data, while still picking some useful signal from the less accurate physics-based data.

I hope this helps clarifying my objectives. Thank you.

1

Ulfgardleo t1_ivvjvsd wrote

  1. you said "unreliable outputs". did you mean inputs? if you truely meant outputs (i.e., the material properties that you want to predict from some so far undefined inputs) then this is what in ml is called "label".
  2. Okay, i have the same issue here. typically ground-truth would be what we called label, but i can see that you would distinguish between simulated/measured ground-truth data.
  3. "model" here is the physics based model, not the ML-model, right?
  4. I don't see it answered. I ask it explicitely: is there any experimental measurmeent for which you also have the physics-model output?
  5. You lost me here.
1

DreamyPen OP t1_ivvm1k0 wrote

  1. Yes I did mean outputs/targets. The features are always known, they correspond to testing conditions (a certain temperature, a certain processing speed, etc.) Given these testing conditions (inputs / labels), can we predict the material properties (outputs/targets) Experimental measurements are very reliable.

  2. The physics based model can always output a prediction for any given labels (testing conditions). But it is not always reliable. We would still like to include them because it allows us to augment the small experimental data set, and, often times, it is quite good approximation from the ground truth. This will also answer 4. Indeed, since the physics based model can always make predictions, we will have in some instances both reliable and unreliable data.

  3. Correct! :)

  4. We do indeed.

  5. Hopefully my response to 1. clarified it.

Let me know if the goal is clearer, and thank you for your help.

1

Ulfgardleo t1_ivxodj0 wrote

  1. okay you completely confuse everyone in the ML community when you call inputs "labels". lets keep with inputs/outputs.

  2. This is good, because it allows you to estimate some crude measure for the quality of the physics model.

So, label noise is a broad field. I am mostly knowledgeable in the classification setting, where label noise has different effects. Moreover, you are not in the standard noisy label setting, because the noise is not independent of the label, so just using weights will be difficult. Similarly, if you have more than one output to predict, a single weight is difficult to compute.

The standard way to derive all of these methods is by noting that the MSE can be derived as the log-probability of the normal distribution p(y|f) where y is the ground truth and f is the mean, and variance is some fixed value. For the mse, the value of the variance does not matter, as long as it remains fixed, but with fairly little effort you can show that as soon as you give samples individual variances, this amounts to weighting the MSE.

So, the cheapest approach would be to give outcomes from the different sources a different variance and if you have more than one output, you will also have more than one variance. How do you guess the parameters? well, make them learnable parameters and train them together with your model parameters.

Of course you can make it arbitrarily complicated. Since your cheap labels come from a physics simulation, errors are likely correlated so you can learn a full covariance matrix. And from there you can make it as complex as you like by making the error distribution more complex, but you will likely not have enough data to do so.

1