Viewing a single comment thread. View all comments

trnka t1_iz06hbj wrote

Multi-task learning has a long history with mixed results - sometimes very beneficial, and sometimes it just flops. At my previous job, we had one situation in which it was helpful and another situation in which it was harmful.

In the harmful situation, adding outputs and keeping the other layers the same led to slight reductions in quality at both tasks. I assume that it could've been salvaged if we'd increased the number of parameters -- I think the different outputs were effectively "competing" for hidden params.

Another way to look at this is that multi-task is effective regularization, so you can increase the number of parameters without as much risk of horrible overfitting. If I remember correctly there's research to show that overparameterized networks tend to get stuck in local minima less often.

One last story from the field -- in one of our multi-task learning situations, we found that it was easier to observe local minima by just checking per-output metrics. Two training runs might have the same aggregate metric, but one might be far better at output A and the other far better at output B.

2

pyepyepie t1_iz0wj9m wrote

Super interesting. I like the story about the metrics, very useful for people who are new to data science. Even when it's not solvable (I assume in your case it was but in MARL, for example, sometimes if you just aim for Pareto optimality you get a weird division of "goods") you would rather have two models with x-5% accuracy than 1 model with x+15 and another with x-15 most of the time. We get money to know the systems we build :)

BTW, what you talk about seems related to this https://openai.com/blog/deep-double-descent/ (deep double descent). That phenomenon is clearly magic :D I have heard some explanations about weight initialization at a conference but to be honest I really don't have anything intelligent to say about it, would be interesting to see if it's the standrad type of networks in 20 years.

1

trnka t1_iz1nk60 wrote

Oh interesting paper - I haven't seen that paper before.

For what it's worth, I haven't observed double-descent personally, though I suppose I'd only notice it for sure with training time. We almost always had typical learning curves with epochs - training loss decreases smoothly as expected, and testing loss hits a bottom then starts climbing unless there's a TON of regularization.

We probably would've seen it with the number of model parameters cause we did random searches on those periodically and graphed the correlations. I only remember seeing one peak on those, though we generally didn't evaluate beyond 2x the number of params of our most recent best.

I probably wouldn't have observed the effect with more data because our distribution shifted over the years, for instance in 2020 we got a lot more respiratory infections coming in due to COVID which temporarily decreased numbers then increased them because it's easier to guess than other things.

2