Submitted by Simusid t3_11okrni in MachineLearning
deliciously_methodic t1_jcifdxa wrote
Reply to comment by Simusid in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid
Thanks very informative. Can we dumb this down further? What would a 3 dimensional embedding table look like for the following sentences? And how do we go from words to numbers, what is the algorithm?
- Bank deposit.
- Bank withdrawal.
- River bank.
Simusid OP t1_jciguq5 wrote
"words to numbers" is the secret sauce of all the models including the new GPT-4. Individual words are tokenized (sometimes into "word pieces") and a mapping from the tokens to numbers via a vocabulary is made. Then the model is trained on pairs of sentences A and B. Sometimes the model is shown a pair where B correctly follows A, and sometimes not. Eventually the model learns to predict what is most likely to come next.
"he went to the bank", "he made a deposit"
B probably follows A
"he went to the bank", "he bought a duck"
Does not.
That is one type of training to learn valid/invalid text. Another is "leave one out" training. In this case the input is a full sentence minus one word (typically).
"he went to the convenience store and bought a gallon of _____"
and the model should learn that the most common answer will probably be "milk"
​
Back to your first question. In 3D your first two embeddings should be closer together because they are similar. And they should be both "far' from the third encoding.
Viewing a single comment thread. View all comments