mildresponse t1_j4xjmvw wrote on January 18, 2023 at 11:23 PM

Reply to comment by inquisitor49 in [D] Simple Questions Thread by AutoModerator

My interpretation is that the words should have different embedding values when they have different positions (context) in the input. Without a positional embedding, the learned word embeddings will be forced into some kind of positional average. The positional offsets give the model more flexibility to resolve differently in different contexts.

Because the embeddings are high dimensional vectors of floats, I'd guess the risk of degeneracy (i.e. that the embeddings could start to overlap with one another) is virtually 0.