Submitted by MazenAmria t3_zhvwvl in deeplearning

Hello everyone, I'm working on my Bachelor's Graduation Project, which is a research of Microsoft's SWIN Transformer architecture and how it would perform when compressed using Knowledge Distillation.

However, I'm having some difficulties training the model because it has to be trained on ImageNet for around 300 epochs. Considering that I want to make several modifications and evaluate them. I have a GTX 1070, which is a good GPU for Deep Learning tasks, however, in this case, it's not even enough to run a single experiment within the given time.

As an alternative approach, I thought of applying the same experiments to the MNIST dataset and comparing the results with the results of training the same student model without any distillation. This way, I can examine the effect of the distillation. But I have some concerns about MNIST itself. Since much simpler models would perform well on MNIST, the results of using SWIN Transformer might be useless or impractical.

I would be happy to hear some advices and opinions.

UPDATE: I've also considered using ImageNet mini (a subset of imagenet that is just 4GB), but the accuracy is improving very slowly.

5

Comments

You must log in or register to comment.

suflaj t1_izoh23q wrote

As someone who tried finetuning on SWIN as part of my graduate thesis, I will warn you that you shouldn't expect good results on the Tiny version. No matter what detector I used it performed worse than the ancient RetinaNet for some reason... Regression was near perfect, albeit with many duplicate detections, but classification was complete garbage, getting me up to 0.45 mAP (whereas Retina can get like 0.8 no problem)

So, take at least the small version.

5

MazenAmria OP t1_izonquh wrote

That's sad; I'm starting to believe that this research idea is impractical or, maybe more accurately, overly ambitious.

2

suflaj t1_izorabe wrote

I don't think it's SWIN per se. I think the detectors (which take 5 feature maps of different level of detail) are incompatible with the 4 blocks of transformers which lack the spatial bias convolutional networks provide and the Tiny model being too small.

Other than that, pretraining (near) SOTA models is impractical for anyone other than big corpo for quite some time now. But you could always try asking your mentor for your uni's compute - my faculty offered GPUs ranging from 1080Tis to A100s.

Although I don't realize why you insist on pretraining SWIN, many SWIN models pretrained on ImageNet are already available. So you just have to do the distillation part on some part of the pretraining input distribution. Not only offered as part of MMCV, but Huggingface as well.

3

MazenAmria OP t1_izrgk9j wrote

I'm already using a pretrained model as the teacher model. But the distillation part itself has nearly the cost of training a model. I'm not insisting but I feel like I'm doing something wrong and needed some advices (note that I've only had theoritical experience in such areas of research, this is the first time I'm doing it practically).

Thanks for you comments. gif

1

suflaj t1_izruvvi wrote

That makes no sense. Are you sure you're not doing backprop on the teacher model? It should be a lot less resource intensive.

Furthermore, check how you're distilling the model, i.e. what layers and what weights. Generally, for transformer architectures, you distill the first, embedding layer, the attention and hidden layers, and the final, prediction layer. Distilling only the prediction layer works poorly.

2

MazenAmria OP t1_izt68w9 wrote

I'm using with torch.no_grad(): when calculating the output of the teacher model.

1

suflaj t1_iztjolh wrote

Then it's strange. Unless you're using a similarly sized student model, there is no reason why a no_grad teacher and a student are similarly resource intensive as a teacher with backprop.

As a rule of the thumb, you should expend several times less memory. How much less are you expending for the same batch size in your case?

1

sqweeeeeeeeeeeeeeeps t1_izob3yb wrote

MNIST and Imagenrt is a huge range. Try something in between, preferably multiple. For example CIFAR-10 and CIFAR-100. I would expect it to perform more similarly to the full SWIN model on citar-10 because less data complexity.

1

MazenAmria OP t1_izon556 wrote

> I would expect it to perform more similarly to the full SWIN model on citar-10 because less data complexity.

And that's the problem. If I got say 98% accuracy on CIFAR-10 using SWIN-Tiny and then got the same 98% with a smaller model then I'm not proving anything. There are many simple models that can get 98% on CIFAR-10 so what improvement did I introduce to the SWIN-Tiny? But doing the same thing with ImageNet would be different.

1

sqweeeeeeeeeeeeeeeps t1_izphlmd wrote

? You are proving your SWIN model is overparameterized for CIFAR. Make an EVEN simpler model than those, you prob won’t be able to with off the shelf distillation. Doing this just for ImageNet literally doesn’t change anything. It’s just a different more complex dataset.

What’s your end goal? To come up with a distillation technique to make NN’s more efficient and smaller?

1

MazenAmria OP t1_izpii1s wrote

To examine SWIN itself whether it's overparameterized or not.

1

sqweeeeeeeeeeeeeeeps t1_izspv5o wrote

Showing you can create a smaller model with the same performance means SWIN is overparameterized for that given task. Give it datasets with varying complexity, not just one single one.

2

pr0d_ t1_izqjmmk wrote

yeah as per my comment, the DEiT papers explored knowledge distillation based off Vision Transformers. What you want to do here is probably similar, and the resources needed to prove it is huge to say the list. Any chance you've discussed this with your advisor?

1

MazenAmria OP t1_izrgnco wrote

I remember reading it, I'll read it again and discuss it. Thanks.

1

pr0d_ t1_izqj9d8 wrote

any chance you've read the DEIT papers?

0