Submitted by Simusid t3_11okrni in MachineLearning
Simusid OP t1_jciguq5 wrote
Reply to comment by deliciously_methodic in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid
"words to numbers" is the secret sauce of all the models including the new GPT-4. Individual words are tokenized (sometimes into "word pieces") and a mapping from the tokens to numbers via a vocabulary is made. Then the model is trained on pairs of sentences A and B. Sometimes the model is shown a pair where B correctly follows A, and sometimes not. Eventually the model learns to predict what is most likely to come next.
"he went to the bank", "he made a deposit"
B probably follows A
"he went to the bank", "he bought a duck"
Does not.
That is one type of training to learn valid/invalid text. Another is "leave one out" training. In this case the input is a full sentence minus one word (typically).
"he went to the convenience store and bought a gallon of _____"
and the model should learn that the most common answer will probably be "milk"
​
Back to your first question. In 3D your first two embeddings should be closer together because they are similar. And they should be both "far' from the third encoding.
Viewing a single comment thread. View all comments