
stecas OP t1_j28o6xw wrote

I just checked the papers. There are 16 x 16 total words. The length of sentences is standardized i.e. all images have the same representation length when given to the transformer. It’s not that each word corresponds to 16x16x3 pixels.

But you understand my point right? I’m asking about why the images are cut up into words spatially instead of channel wise.