Hello Everyone 👋,

I just implemented the paper named AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE popularly known as the vision transformer paper. This paper uses a Transformer encoder for image recognition. It achieves state-of-the-art performance without using convolutional layers given that we have a huge dataset and enough computational resources.
Below I am sharing my implementation of this paper, please have a look and give it a 🌟 if you like it. This implementation provides easy-to-read code for understanding how the model works internally.

My implementation: GitHub Link

Thanks for your attention. 😀

Comments

You must log in or register to comment.

MOSFETBJT t1_j078ma9 wrote on December 14, 2022 at 4:04 PM

Thanks dude. Tensorflow gets a lot of hate on this sub. But I think part of it is people memeing

skadoodlee t1_j07z95s wrote on December 14, 2022 at 6:55 PM

Why does it get hate

M4xM9450 t1_j08eql4 wrote on December 14, 2022 at 8:31 PM

It started out being not as “pythonic” as pytorch and so people flocked over to pytorch. Many new papers and models are implemented in pytorch and very few see the point in converting them to tensorflow since many of these models just run on desktops or servers. That said, both frameworks have their ups and downs. I myself have started with keras when it first got integrated into tensorflow and haven’t really wanted to use pytorch because it’s limited in being brought to web/mobile apps.

CatalyzeX_code_bot t1_j06c8zw wrote on December 14, 2022 at 11:40 AM

Found relevant code at https://github.com/google-research/vision_transformer + all code implementations here

To opt out from receiving code links, DM me

Keepclamand- t1_j06k9ky wrote on December 14, 2022 at 1:05 PM

Interesting. Can share some results and learning?

TensorDudee OP t1_j06kh4d wrote on December 14, 2022 at 1:07 PM

Without pretraining the model overfits on CIFAR 10.

Internal-Diet-514 t1_j073xp9 wrote on December 14, 2022 at 3:33 PM

Stuff like that always makes me wonder. I mean if they had to train it on several other datasets before training it on CIFAR-10, isn’t it a worse architecture (for the specific problem) than one that performs well trained from scratch on CIFAR-10? And if that model followed the same training procedure as the VIT I wonder if it would beat it.

murrdpirate t1_j07k4v2 wrote on December 14, 2022 at 5:19 PM

I don't think "worse" is a clear description. The issue is just that it's too complex for CIFAR-10 alone. Any model can be increased in complexity until it overfits, and thus performs worse.

A model that doesn't overfit on CIFAR-10 is unlikely to benefit from pretraining on other datasets. Unless somehow the other datasets are more closely aligned to CIFAR-10 Test than CIFAR-10 Train is.

Internal-Diet-514 t1_j07s3t2 wrote on December 14, 2022 at 6:09 PM

I think that’s why we have to be careful how we add complexity. The same model with more parameters will overfit quicker because it can start to memorize the training set, but if we add complexity in its ability to model more meaningful relationships in the data tied to the response than I think overfitting would still happen, but we’d still get better validation performance. So maybe VIT for cifar-10 didn’t add any additional capabilities that were worth it for the problem, just additional complexity.

murrdpirate t1_j087lji wrote on December 14, 2022 at 7:47 PM

>I think overfitting would still happen, but we’d still get better validation performance.

I think by definition, overfitting means your validation performance decreases (or at least does not increase).

>So maybe VIT for cifar-10 didn’t add any additional capabilities that were worth it for the problem, just additional complexity

Depends on what you mean by "the problem." The problem could be:

Get the best possible performance on CIFAR-10 Test
Get the best possible performance on CIFAR-10 Test, but only train on CIFAR-10 Train

Even if it was the second one, you could likely just reduce the complexity of the VIT model and have it outperform other models. Or keep it the same, but use heavy regularization during training.

nucLeaRStarcraft t1_j07bufu wrote on December 14, 2022 at 4:25 PM

We're generally trying to maximize the available labeled data. If the Transformer can ingest more data and in the end performs better than any other non-attention based model, given the same amount of data, then, it's a better architecture.

However, you are asking a proper question, but I think the body of recent work shows that the Transformer indeed generalizes better. Otherwise, we'd see similar results with non-transformed based architectures, since the data and compute is already there for these groups who do this kind of research.

pyepyepie t1_j07gugl wrote on December 14, 2022 at 4:58 PM

I think it's kind of important to state what our models do better, I really dislike this SOTA thing on some dataset, Internal-Diet has a point here.

Internal-Diet-514 t1_j07pfk6 wrote on December 14, 2022 at 5:52 PM

On your first paragraph when you say given the same amount of data isn’t it shown here that the VIT was given more data as it was trained with other datasets as well, before being fine tuned on cifar-10? And then compared to other models which were most likely trained on cifar-10 alone? I guess my worry is if we’re going to do a proper comparison between models that they should all follow the same training procedure. You can reach SOTA performance on a dataset using other techniques rather than architecture alone.

nucLeaRStarcraft t1_j08cjvc wrote on December 14, 2022 at 8:18 PM

I agree with you, if we want to test the architecture, we should use the same training procedure, including pre-training.

My theory is, that given the current results of GPT-like models, which use transformers under the hood, and given the fact that these groups have the compute power and data to train non-attention based recurrent models, it's quite unlikely that the architecture isn't a main contributor.

pyepyepie t1_j07bgek wrote on December 14, 2022 at 4:23 PM

Just my 2 cents, ignoring the specific model details (as I don't do vision): Well, you would assume every model works differently on different data. For example, try to train a large NN on 10 examples that are y = mx + b, and then try to do the same but with a linear model. The same applies also in less clear situations, i.e. larger models that require more data vs larger models that are more sample efficient but introduce more bias.

Internal-Diet-514 t1_j07qmb0 wrote on December 14, 2022 at 6:00 PM

I agree with you, it’s just now a days when people say they have created an architecture that outperforms some baseline they really means it outperforms some baseline on image net or cifar or some other established dataset. All data is different and I really think the focus should be what added ability does this architecture have to model relationships between the input data that a baseline doesn’t and how does that help with this specific problem. Which is why the transformer was such a great architecture to begin with for NLP problems because it demonstrated the ability to model longer range dependencies over an LSTM like architecture. I’m just not sure it translated well to vision when we begin to say it’s better than a pure CNN based architecture.

pyepyepie t1_j08pa80 wrote on December 14, 2022 at 9:37 PM

Ideas > performance, for sure :)

assimil8or t1_j07u7jt wrote on December 14, 2022 at 6:23 PM

Who still cares about CIFAR-10 though? I know it’s a standard dataset but just seem like it’s completely solved in so many different ways … better to look at harder problems.

Osamabinbush t1_j09pb9w wrote on December 15, 2022 at 1:56 AM

I think you probably didn't train it long enough to reach the interpolating regime if you are over-fitting ViT.

[deleted] t1_j07awk0 wrote on December 14, 2022 at 4:19 PM

[deleted]

TensorDudee OP t1_j07oard wrote on December 14, 2022 at 5:45 PM

Guys if you like it please show some ♥️ by starring the repository.

anymorenevermore t1_j07q7qw wrote on December 14, 2022 at 5:57 PM

is this the "subscribe to my OF" of nerds?

[deleted] t1_j09t0n8 wrote on December 15, 2022 at 2:23 AM

what stands OF for ?

trashacount12345 t1_j09v4tc wrote on December 15, 2022 at 2:38 AM

Onlyfans

anymorenevermore t1_j09tm5m wrote on December 15, 2022 at 2:27 AM

if you have to ask ,you dont want or need to know

Valdaora t1_j06u0vk wrote on December 14, 2022 at 2:24 PM

LOL Just learn pytorch

Deep-Station-1746 t1_j06kayz wrote on December 14, 2022 at 1:05 PM

Solid work. This reminds me of that internet explorer meme.

TensorDudee OP t1_j06keu3 wrote on December 14, 2022 at 1:06 PM

Who so?

Deep-Station-1746 t1_j06ku87 wrote on December 14, 2022 at 1:10 PM

Everything is slow and hard to implement on tensorflow, without much redeemable excuses either (compared to JAX e.g.).

S8nSins t1_j06plvw wrote on December 14, 2022 at 1:50 PM

This guy must be still using Tensorflow 1.x

TensorDudee OP t1_j06stvm wrote on December 14, 2022 at 2:15 PM

I do not know about others but TensorFlow 1.x (tf.compat.v1) is still my favorite. But the learning curve is steep.

therealjtgill t1_j070otn wrote on December 14, 2022 at 3:11 PM

Same - 1.15 is my favorite

Erosis t1_j07aho6 wrote on December 14, 2022 at 4:16 PM

Yet people here praise Torch when Tensorflow equivalents are often faster in production. Tensorflow still has relevance and gets a bit too much hate here (and I personally prefer pytorch).