Viewing a single comment thread. View all comments

tdgros t1_j28rhcc wrote

Ah I see what you mean, you're right, my way of seeing is the one that is not standard. My point is that transformers don't really care about the original modality or the order or spatial arrangement of their tokens, ViTs are just transformers over sequences of "patches of pixels" (note, where channels are flattened together!) On top of this, there is work to forcefully bring back locality biases (position embeddings, swin transformers...), this explains why I don't tend to break tokens into different dimensions. You can recompose the sequence into a (H/16)x(W/16)xNdims images, the channels of which can be visualized separately if you want. More often, it's the attention mapsthemselves that are used for visualization or interpetation, head per head (i.e. the number of channels here really is the number of heads)

2