Viewing a single comment thread. View all comments

Simusid OP t1_jbt91tb wrote

My main goal was to just visualize the embeddings to see if they are grossly different. They are not. That is just a qualitative view. My second goal was to use the embeddings with a trivial supervised classifier. The dataset is labeled with four labels. So I made a generic network to see if there was any consistency in the training. And regardless of hyperparameters, the OpenAI embeddings seemed to always outperform the SentenceTransformer embeddings, slightly but consistency.

This was not meant to be rigorous. I did this to get a general feel of the quality of the embeddings, plus to get a little experience with the OpenAI API.

30

quitenominal t1_jbtr6g7 wrote

fwiw this has also been my finding when comparing these two embeddings for classification tasks. Better, but not enough to justify the cost

8

polandtown t1_jbu2zqe wrote

Learning here, but how are you axes defined? Some kind of factor(s) or component(s) extracted from each individual embedding? Thanks for the visualization, as it made me curious and interested! Good work!

6

Simusid OP t1_jbu3q8m wrote

Here is some explanation about UMAP axes and why they should usually be ignored: https://stats.stackexchange.com/questions/527235/how-to-interpret-axis-of-umap

Basically it's because they are nonlinear.

12

onkus t1_jbwftny wrote

Doesn’t this also make it essentially impossible to compare the two figures you’ve shown?

6

Thog78 t1_jbyh4w1 wrote

What you're looking for when comparing UMAPs is if the local relationships are the same. Try to recognize clusters and see their neighbors, or whether they are distinct or not. A much finer colored clustering based on another reduction (typically PCA) helps with that. Without clustering, you can only try to recognize landmarks from their size and shape.

2