Viewing a single comment thread. View all comments

mildresponse t1_j4xjmvw wrote

My interpretation is that the words should have different embedding values when they have different positions (context) in the input. Without a positional embedding, the learned word embeddings will be forced into some kind of positional average. The positional offsets give the model more flexibility to resolve differently in different contexts.

Because the embeddings are high dimensional vectors of floats, I'd guess the risk of degeneracy (i.e. that the embeddings could start to overlap with one another) is virtually 0.

1