a6nkc7 t1_iyy22ml wrote on December 5, 2022 at 12:59 AM

#860,517

Generally, you do it when you want to get some idea of the covariance between the outputs conditional on the inputs.

michaelaalcorn t1_iyyhfhw wrote on December 5, 2022 at 3:00 AM

#861,226

Training a single model on three target variables is equivalent to training three separate models that have shared parameters except for the final layer (assuming a mean squared error loss in both cases), so training a single model effectively regularizes the three models. Whether or not this is a good thing will depend on the dataset, but in the limit of infinite data, three separate models will give you better overall performance than a single model since they won't be regularized.

purplebrown_updown t1_iyz8ykc wrote on December 5, 2022 at 7:43 AM

#862,366

Efficiency mostly. But it can also be a matter of accuracy. You should also be hyper parameter tuning each model so that becomes cumbersome, especially if you have thousands of outputs.

smsorin t1_iyza1zg wrote on December 5, 2022 at 7:59 AM

#862,407

If you are inference constrained, it might be better. Since a good chunk of the model is shared you need less compute and perhaps even less time, if you can't paralellize sufficiently. The other comments here have other good arguments.

anjmon t1_iyzmeog wrote on December 5, 2022 at 11:05 AM

#862,754

On a related note, I am curious about what kind of data are you working on? I am a beginner and looking to try out regression on real and novel datasets.

PredictorX1 t1_iyzsby0 wrote on December 5, 2022 at 12:20 PM

#862,957

For modeling solutions featuring intermediate calculations (such as the hidden layers of multilayer perceptrons), the hope is that what is learned about each target variable might be "shared" with the others. Whether this effect yields a net gain depends on the nature of the data. Outputs in a multiple-output model which is trained iteratively tend to reach their optimum performance at differing numbers of iterations. There is also the logistical benefit of only having to train one, larger model versus several.

[deleted] t1_iyzt8ma wrote on December 5, 2022 at 12:30 PM

#862,989

Replying to a6nkc7 (#860,517)

[deleted]

[deleted] t1_iyzu6pz wrote on December 5, 2022 at 12:41 PM

#863,027

Replying to a6nkc7 (#860,517)

[deleted]

pyepyepie t1_iyzvql1 wrote on December 5, 2022 at 12:57 PM

#863,086

Replying to michaelaalcorn (#861,226)

Great answer, but I am a little unsure about the last line. If you are using ANN you can get stuck in a local minimum of the loss function, and I am not sure if learning multiple tasks in parallel will not be beneficial for the model. I am not saying you are incorrect, just trying to learn something new :).

edit: my TLDR question is if sharing weights can't prevent getting stuck in local minimum in the case of ANN, i.e., improving performance.

[deleted] t1_iyzwvk4 wrote on December 5, 2022 at 1:08 PM

#863,135

Replying to pyepyepie (#863,086)

[deleted]

trnka t1_iz06hbj wrote on December 5, 2022 at 2:33 PM

#863,607

Replying to pyepyepie (#863,086)

Multi-task learning has a long history with mixed results - sometimes very beneficial, and sometimes it just flops. At my previous job, we had one situation in which it was helpful and another situation in which it was harmful.

In the harmful situation, adding outputs and keeping the other layers the same led to slight reductions in quality at both tasks. I assume that it could've been salvaged if we'd increased the number of parameters -- I think the different outputs were effectively "competing" for hidden params.

Another way to look at this is that multi-task is effective regularization, so you can increase the number of parameters without as much risk of horrible overfitting. If I remember correctly there's research to show that overparameterized networks tend to get stuck in local minima less often.

One last story from the field -- in one of our multi-task learning situations, we found that it was easier to observe local minima by just checking per-output metrics. Two training runs might have the same aggregate metric, but one might be far better at output A and the other far better at output B.

pyepyepie t1_iz0wj9m wrote on December 5, 2022 at 5:34 PM

#864,979

Replying to trnka (#863,607)

Super interesting. I like the story about the metrics, very useful for people who are new to data science. Even when it's not solvable (I assume in your case it was but in MARL, for example, sometimes if you just aim for Pareto optimality you get a weird division of "goods") you would rather have two models with x-5% accuracy than 1 model with x+15 and another with x-15 most of the time. We get money to know the systems we build :)

BTW, what you talk about seems related to this https://openai.com/blog/deep-double-descent/ (deep double descent). That phenomenon is clearly magic :D I have heard some explanations about weight initialization at a conference but to be honest I really don't have anything intelligent to say about it, would be interesting to see if it's the standrad type of networks in 20 years.

Dubgarden t1_iz0xbac wrote on December 5, 2022 at 5:39 PM

#865,011

Replying to a6nkc7 (#860,517)

Could you explain that in a bit more detail please? Im curious.

trnka t1_iz1nk60 wrote on December 5, 2022 at 8:26 PM

#866,361

Replying to pyepyepie (#864,979)

Oh interesting paper - I haven't seen that paper before.

For what it's worth, I haven't observed double-descent personally, though I suppose I'd only notice it for sure with training time. We almost always had typical learning curves with epochs - training loss decreases smoothly as expected, and testing loss hits a bottom then starts climbing unless there's a TON of regularization.

We probably would've seen it with the number of model parameters cause we did random searches on those periodically and graphed the correlations. I only remember seeing one peak on those, though we generally didn't evaluate beyond 2x the number of params of our most recent best.

I probably wouldn't have observed the effect with more data because our distribution shifted over the years, for instance in 2020 we got a lot more respiratory infections coming in due to COVID which temporarily decreased numbers then increased them because it's easier to guess than other things.

[D] What is the advantage of multi output regression over doing it individually for each target variable

Comments

a6nkc7 t1_iyy22ml wrote on December 5, 2022 at 12:59 AM

michaelaalcorn t1_iyyhfhw wrote on December 5, 2022 at 3:00 AM

purplebrown_updown t1_iyz8ykc wrote on December 5, 2022 at 7:43 AM

smsorin t1_iyza1zg wrote on December 5, 2022 at 7:59 AM

anjmon t1_iyzmeog wrote on December 5, 2022 at 11:05 AM

PredictorX1 t1_iyzsby0 wrote on December 5, 2022 at 12:20 PM

[deleted] t1_iyzt8ma wrote on December 5, 2022 at 12:30 PM

[deleted] t1_iyzu6pz wrote on December 5, 2022 at 12:41 PM

pyepyepie t1_iyzvql1 wrote on December 5, 2022 at 12:57 PM

[deleted] t1_iyzwvk4 wrote on December 5, 2022 at 1:08 PM

trnka t1_iz06hbj wrote on December 5, 2022 at 2:33 PM

pyepyepie t1_iz0wj9m wrote on December 5, 2022 at 5:34 PM

Dubgarden t1_iz0xbac wrote on December 5, 2022 at 5:39 PM

trnka t1_iz1nk60 wrote on December 5, 2022 at 8:26 PM