deliciously_methodic t1_jcifdxa wrote on March 17, 2023 at 1:26 AM

Reply to comment by Simusid in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid

Thanks very informative. Can we dumb this down further? What would a 3 dimensional embedding table look like for the following sentences? And how do we go from words to numbers, what is the algorithm?

Bank deposit.
Bank withdrawal.
River bank.

Simusid OP t1_jciguq5 wrote on March 17, 2023 at 1:38 AM

"words to numbers" is the secret sauce of all the models including the new GPT-4. Individual words are tokenized (sometimes into "word pieces") and a mapping from the tokens to numbers via a vocabulary is made. Then the model is trained on pairs of sentences A and B. Sometimes the model is shown a pair where B correctly follows A, and sometimes not. Eventually the model learns to predict what is most likely to come next.

"he went to the bank", "he made a deposit"

B probably follows A

"he went to the bank", "he bought a duck"

Does not.

That is one type of training to learn valid/invalid text. Another is "leave one out" training. In this case the input is a full sentence minus one word (typically).

"he went to the convenience store and bought a gallon of _____"

and the model should learn that the most common answer will probably be "milk"

Back to your first question. In 3D your first two embeddings should be closer together because they are similar. And they should be both "far' from the third encoding.