I just launched searchthearxiv.com, a simple semantic search engine over virtually all ML papers published on arXiv since 2012. The site uses OpenAI's `text-embedding-ada-002` model to match the embedding of your query against each of the paper embeddings, retrieving the ones with the highest cosine similarity. It also allows you to insert an arXiv link to find similar papers.

This was mostly meant as a fun side project. However, if people find it useful, I'm happy to maintain it and keep the database up-to-date. I'd love to know what you think! ❤️

Update: Thanks to u/ml-research for pointing out that some papers were excluded from search results regardless of the search query. This was due to a bug in the way the database was queried, and should now be fixed.

Comments

You must log in or register to comment.

coumineol t1_j3hm4t4 wrote on January 8, 2023 at 5:18 PM

#1,309,955

Is that meaningfully better than just googling?

dreaming_geometry t1_j3hnyvd wrote on January 8, 2023 at 5:29 PM

#1,310,018

Replying to coumineol (#1,309,955)

Data not yet collected. Why don't you try some side-by-side comparisons and report back?

jakderrida t1_j3hsusc wrote on January 8, 2023 at 6:00 PM

#1,310,186

I looked up clown porn and I didn't find anything useful.

[deleted] t1_j3i3rwu wrote on January 8, 2023 at 7:06 PM

#1,310,508

[deleted]

universal_explainer OP t1_j3if5d5 wrote on January 8, 2023 at 8:15 PM

#1,310,865

Replying to coumineol (#1,309,955)

Might be in some cases, maybe not in others. Anecdotally, a query like "model using only attention mechanism site:arxiv.org" on Google doesn't bring up "Attention Is All You Need", while it does here. Aside from that, it might be a useful resource for finding similar papers based on an arXiv link.

universal_explainer OP t1_j3ij9jz wrote on January 8, 2023 at 8:40 PM

#1,310,980

Replying to [deleted] (#1,310,508)

Hey, thanks for trying it out!

First, do you mind sharing an example of different queries that return the same results? I have not been able to reproduce that (unless, of course, the queries are semantically similar, in which case that would be expected).

Also, of course exact search is far superior if you know the title of the paper you are looking for! In that regime, Google Scholar wins every time. However, semantic search might be better if you either a) can't remember the title but do remember some of the content or b) are simply looking to explore papers based on a handful of keywords.

Finally, the size of the database has no bearing on the quality of the embeddings, since I'm using the pretrained model by OpenAI. There is no notion of "popularity" except to rank the 10 papers with the highest cosine similarity to the query embedding according a citation score (if it's available).

[deleted] t1_j3inll7 wrote on January 8, 2023 at 9:05 PM

#1,311,111

Replying to universal_explainer (#1,310,980)

[deleted]

ml-research t1_j3l36cj wrote on January 9, 2023 at 8:18 AM

#1,314,571

Does it omit some papers if it fails to parse them? Because I cannot find some arXiv papers.

universal_explainer OP t1_j3les1h wrote on January 9, 2023 at 11:00 AM

#1,314,979

Replying to ml-research (#1,314,571)

Are you talking about when inserting an arXiv link to find similar papers? In that case, it is important that the paper being referenced is already stored in the database. If it's a very recent paper (as in less than a week or two old), it won't work. This should be easy to fix, though, by simply scraping the abstract from arxiv.org and using it as the query.

If you're talking about searching for specific papers, I'd be interested to know the queries and the desired result. Feel free to post it here or in a DM 🙂

fakesoicansayshit t1_j47vdrd wrote on January 13, 2023 at 7:49 PM

#1,353,591

Replying to universal_explainer (#1,310,980)

Man, connect it to a chatbot after fine tuning it with the citations numbers as a human feedback input and you got yourself an uncensored, local, ML assistant!

Willing to share embeddings?