Comments

You must log in or register to comment.

aaaasd12 t1_j3i3ea8 wrote

It's like transfer learning?

In the company that I'm work only use the normal things like classification tasks/ segmentation with clusters.

Maybe the use case that i see is in NLP with topic modeling using bertopic and tuning the hyperparameters.

But in general simple models are perfect for the tasks that se have.

0

suflaj t1_j3igfzr wrote

Yes, it's the only way to get high throughput high performance models ATM.

With KD and TensorRT you can get close to 100x throughput (compared to eager TF/PyTorch on full model) with 1% performance hit on some models and tasks.

6

NichtMarlon t1_j3ikrkw wrote

Yes its very useful for text classification tasks. Big transformers get highest accuracy, but we can't deploy them because they are too slow. So we distil knowledge from bigger transformers into smaller transformers or CNNs. If you have a decent amount of unlabeled data to pseudo-label with the teacher, there is barely any loss in accuracy for the student model.

4

nmfisher t1_j3l4ipq wrote

Echoing this, KD is also very useful for taking a heavyweight GPU model and training a student model that's light enough to run on mobile. Small sacrifice in quality for huge performance gains.

3

gamerx88 t1_j3m0drc wrote

Yes, we used DistilBERT (and even logistic regression) heavily in my previous startup where data volume was web scale.

Depending on the exact problem, large transformer models can be an overkill. For some straightforward text classification even logistic regression with some feature engineering can hit within 3% point of a transformer, and costs a negligible fraction of them.

3

xenotecc t1_j3v3gr0 wrote

How small do you make the student, when a teacher is let's say ResNet101? How do you find a good student/teacher size ratio?

Are there any tricks to knowledge distillation? Or just standard vanilla procedure?

2

suflaj t1_j3vg5tm wrote

I think it's a matter of trial and error. The best ratios I've seen were 1:25, but these concerned transformer networks, which are much more sparse than resnets.

There are some tricks, but it depends on the model. Ex. for transformers, it's not just enough to imitate the last layer. I suspect that it's the same for resnets, given they're deep residual networks just like transformers.

1