Viewing a single comment thread. View all comments

utopiah t1_jbtx8iv wrote

> What we want is a model that can represent the "semantic content" or idea behind a sentence

We do but is it what embedding actually provide or rather some kind of distance between items, how they might relate or not between each other? I'm not sure that would be sufficient for most people to provide the "idea" behind a sentence, just relatedness. I'm not saying it's not useful but arguing against the semantic aspect here, at least from my understanding of that explanation.

2

Simusid OP t1_jbu0bkv wrote

>We do but is it what embedding actually provide or rather some kind of distance between items,

A single embedding is a single vector, encoding a single sentence. To identify a relationship between sentences, you need to compare vectors. Typically this is done with cosine distance between the vectors. The expectation is that if you have a collection of sentences that all talk about cats, the vectors that represent them will exist in a related neighborhood in the metric space.

2

utopiah t1_jbu0qpa wrote

Still says absolutely nothing if you don't know what a cat is.

−2

Simusid OP t1_jbu2n5w wrote

That was not the point at all.

Continuing the cat analogy, I have two different cameras. I take 20,000 pictures of the same cats with both. I have two datasets of 20,000 cats. Is one dataset superior to the other? I will build a model that tries to predict cats and see if the "quality" of one dataset is better than the other.

In this case, the OpenAI dataset appears to be slightly better.

5