Viewing a single comment thread. View all comments

stecas OP t1_j28o6xw wrote

I just checked the papers. There are 16 x 16 total words. The length of sentences is standardized i.e. all images have the same representation length when given to the transformer. It’s not that each word corresponds to 16x16x3 pixels.

But you understand my point right? I’m asking about why the images are cut up into words spatially instead of channel wise.

3

tdgros t1_j28rhcc wrote

Ah I see what you mean, you're right, my way of seeing is the one that is not standard. My point is that transformers don't really care about the original modality or the order or spatial arrangement of their tokens, ViTs are just transformers over sequences of "patches of pixels" (note, where channels are flattened together!) On top of this, there is work to forcefully bring back locality biases (position embeddings, swin transformers...), this explains why I don't tend to break tokens into different dimensions. You can recompose the sequence into a (H/16)x(W/16)xNdims images, the channels of which can be visualized separately if you want. More often, it's the attention mapsthemselves that are used for visualization or interpetation, head per head (i.e. the number of channels here really is the number of heads)

2