utopiah t1_jbtx8iv wrote on March 11, 2023 at 6:06 PM

Reply to comment by Simusid in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid

> What we want is a model that can represent the "semantic content" or idea behind a sentence

We do but is it what embedding actually provide or rather some kind of distance between items, how they might relate or not between each other? I'm not sure that would be sufficient for most people to provide the "idea" behind a sentence, just relatedness. I'm not saying it's not useful but arguing against the semantic aspect here, at least from my understanding of that explanation.

Simusid OP t1_jbu0bkv wrote on March 11, 2023 at 6:28 PM

>We do but is it what embedding actually provide or rather some kind of distance between items,

A single embedding is a single vector, encoding a single sentence. To identify a relationship between sentences, you need to compare vectors. Typically this is done with cosine distance between the vectors. The expectation is that if you have a collection of sentences that all talk about cats, the vectors that represent them will exist in a related neighborhood in the metric space.

utopiah t1_jbu0qpa wrote on March 11, 2023 at 6:31 PM

Still says absolutely nothing if you don't know what a cat is.

Simusid OP t1_jbu2n5w wrote on March 11, 2023 at 6:44 PM

That was not the point at all.

Continuing the cat analogy, I have two different cameras. I take 20,000 pictures of the same cats with both. I have two datasets of 20,000 cats. Is one dataset superior to the other? I will build a model that tries to predict cats and see if the "quality" of one dataset is better than the other.

In this case, the OpenAI dataset appears to be slightly better.

[deleted] t1_jbtztzc wrote on March 11, 2023 at 6:24 PM

[deleted]