Viewing a single comment thread. View all comments

qalis t1_j5ukjvr wrote

ChatGPT does NOT retrieve any data at all from the internet. It merely remembers statistical patterns of words coming one after another in the typical texts. It has no knowledge of facts, and no means to get them whatsoever. It was also trained with data up to 2021, so there is no training data after that whatsoever. There was an older attempt with WebGPT, but it did not get anywhere AFAIK.

What you need is a semantic search model, which summarizes semantic information from texts as vectors and then performs vector search based on your query. You can use transformer-based model for text vectorization, of course, which may work reasonably well. For specific searches, however, I am pretty sure that in your use case regexes will be just fine.

If you are sure that you need semantic search, use domain-specific model like SciBERT for best results, or fine-tune some pretrained model from Huggingface.

7

Kacper-Lukawski t1_j5um27o wrote

Moreover, you need a proper vector database to avoid kNN-like full scans for every query to run a semantic search at scale. Qdrant (https://qdrant.tech) is one of the options, probably the fastest according to benchmarks.

1

keisukegoda3804 t1_j5v6c7q wrote

How does qdrant compare to other offerings with regards to filtered search?

1

Kacper-Lukawski t1_j5xp10a wrote

Each vector may have a payload object: https://qdrant.tech/documentation/payload/ Payload attributes can be used to make some additional constraints on the search results: https://qdrant.tech/documentation/filtering/ The unique feature is the filtering is already built-in into the vector search phase, so there is no need to pre- or postfilter the results.

1

keisukegoda3804 t1_j5xy9x3 wrote

Do you happen to know how fast it is compared to other services that build-in filtering inside their vector search (pinecone, milvus, etc.)?

1

Kacper-Lukawski t1_j5yineq wrote

I do not know any benchmark that would measure that. It would also be quite challenging to compare to SaaS like Pinecone (it should be running on the same infrastructure to have comparable results). When it comes to Milvus, as far as I know, they use prefiltering for filtered search (https://github.com/milvus-io/milvus/discussions/12927). So they need to store the ids of matching entries somewhere during the vector search phase, possibly even all the ids if your filtering criteria do not exclude anything.

1