Viewing a single comment thread. View all comments

trnka t1_jcalqfm wrote

Converting the text to fixed-size windows is done to make training more efficient. If the inputs are shorter, they're padded up to the correct length with null tokens. Otherwise they're clipped. It's done so that you can combine multiple examples into a single batch, which becomes an additional dimension on your tensors. It's a common technique even for LSTMs/CNNs.

It's often possible to take the trained model and apply it to variable-length testing data so long as you're dealing with a single example at a time rather than a batch. But keep in mind with transformers that attention does N^2 comparisons, where N is the number of tokens, so it doesn't scale well to long texts.

It's possible that the positional encoding may be specific to the input length, depending on the transformer implementation. For instance in Karpathy's GPT recreation video he made the positional encoding learnable by position, so it wouldn't have defined values for longer sequences.

One common alternative in training is to create batches of examples that are mostly the same text length, then pad to the max length. You can get training speedups that way but it takes a bit of extra code.

2