Submitted by Fine-Topic-6127 t3_119ydqv in MachineLearning

A simple question really but one that's pretty difficult to find an answer to:

​

Has anyone done much research into the performance of models vs their size as a function of the output space (and if so where can I find it)? Basically, it's quite clear that for most applications, generalisability of a model can either be achieved by improving the dataset or increasing the size of the model (if your dataset is already good). But because the way performance is measured in SOTA benchmarks it's not necessarily obvious (to me at least) that these larger models are appropriate for more simple problems.

​

Say I have a simple audio classification problem where I only have one class of interest. If I wanted to implement the latest SOTA models in sound classification I'm likely to end up trying to use some pretty large and complicated model architectures. What I would like to know is how does one use SOTA benchmarks to inform their decisions for architectures in the face of tasks that are significantly simpler than those that are used to evaluate the performance of models on these benchmarks?

​

It feels like the simple answer is to just start simple and scale up as required but this does feel somewhat like trial and error so it would be great to hear how other people approach this sort of problem...

4

Comments

You must log in or register to comment.

floppy_llama t1_j9opzwx wrote

Unfortunately a lot of ML is just trial and error

7

thecuteturtle t1_j9qawia wrote

Ain't that the truth. On another note, OP can try optimizing via gridsearch, but theres no avoiding trial and error on this.

2

MadScientist-1214 t1_j9q1bll wrote

The only shortcut I can give you is to look on Kaggle to see what the competitors have used. Most of the papers are not suitable for real world applications. It's not really about the complexity or scale of the task, but rather that the authors leave out some important information. For example, in object detection, there is DETR, but if you look on Kaggle, nobody uses that. The reason is that the original DETR has too slow a convergence speed and was only trained on 640 size images. Instead, many people still use YOLO. But you don't realize that until you try it yourself or someone tells you.

2

koolaidman123 t1_j9qhp9n wrote

i have rarely encountered situations where scaling up mode (eg resnet34 -> resnet50, deberta base -> deberta large/xl) doesn't help. whether it's practical to may be a different story

1

skelly0311 t1_j9scr5c wrote

First thing to note. The best way to improve generalisability and accuracy is to have as accurate data as possible. If your data is trash, it doesn't matter how many parameters your classifier is using, it will not produce good results.

Now, in my experience using with transformer neural networks, If the task is a simple binary classification task or multi label with less than 8 or so labels(maybe more), the small models(14 million parameters) perform similar to the base models(110 million parameters). Once the objective function becomes more complicated, such as training a zero shot learner, more parameters means achieving a much lower loss. In the case just mentioned, using the large models(335 million parameters) had a significant improvement over the base model(110 million parameters).

It's hard to define and quantify how complicated an objective function is. But just know that the more parameters doesn't always mean better if the objective function is simple enough.

1

martianunlimited t1_j9sh43x wrote

Not exactly what you are asking, but there is this paper on scaling law that states that (assuming that the training data is representative of the distribution) for at least large langauge models, how the performance of transformers scale to the amount of data and compare it to other network architecture.... https://arxiv.org/pdf/2001.08361.pdf we don't have anything similar for other types of data.

1