Comments

You must log in or register to comment.

qalis t1_j5ukjvr wrote

ChatGPT does NOT retrieve any data at all from the internet. It merely remembers statistical patterns of words coming one after another in the typical texts. It has no knowledge of facts, and no means to get them whatsoever. It was also trained with data up to 2021, so there is no training data after that whatsoever. There was an older attempt with WebGPT, but it did not get anywhere AFAIK.

What you need is a semantic search model, which summarizes semantic information from texts as vectors and then performs vector search based on your query. You can use transformer-based model for text vectorization, of course, which may work reasonably well. For specific searches, however, I am pretty sure that in your use case regexes will be just fine.

If you are sure that you need semantic search, use domain-specific model like SciBERT for best results, or fine-tune some pretrained model from Huggingface.

7

Kacper-Lukawski t1_j5um27o wrote

Moreover, you need a proper vector database to avoid kNN-like full scans for every query to run a semantic search at scale. Qdrant (https://qdrant.tech) is one of the options, probably the fastest according to benchmarks.

1

keisukegoda3804 t1_j5v6c7q wrote

How does qdrant compare to other offerings with regards to filtered search?

1

Kacper-Lukawski t1_j5xp10a wrote

Each vector may have a payload object: https://qdrant.tech/documentation/payload/ Payload attributes can be used to make some additional constraints on the search results: https://qdrant.tech/documentation/filtering/ The unique feature is the filtering is already built-in into the vector search phase, so there is no need to pre- or postfilter the results.

1

keisukegoda3804 t1_j5xy9x3 wrote

Do you happen to know how fast it is compared to other services that build-in filtering inside their vector search (pinecone, milvus, etc.)?

1

Kacper-Lukawski t1_j5yineq wrote

I do not know any benchmark that would measure that. It would also be quite challenging to compare to SaaS like Pinecone (it should be running on the same infrastructure to have comparable results). When it comes to Milvus, as far as I know, they use prefiltering for filtered search (https://github.com/milvus-io/milvus/discussions/12927). So they need to store the ids of matching entries somewhere during the vector search phase, possibly even all the ids if your filtering criteria do not exclude anything.

1

EmmyNoetherRing t1_j5ulz6t wrote

So-- a few things

ChatGPT doesn't currently have access to the internet, although it's obviously working with data it scraped in the recent past, and I expect searching wikipedia from 2021 is sufficient to answer a wide array of queries, which is why it feels like it has internet access when you ask it questions.

ChatGPT is effective because it's been trained on an unimaginably large set of data, and had an unknown large number of human hours gone into supervised/interactive/online/reinforcement/(whatever) learning where an army of contractors has trained it how to deal well with arbitrary human prompts. You don't really want an AI trained just on your data set by itself.

But ChatGPT (or just plain GPT3) is great for summarizing bodies of text as it is right now. I expect you should be able to google how to nicely ask GPT3 to summarize your notes or answer questions with respect to them.

7

dancingnightly t1_j5v5zwe wrote

The internet isn't accessed live by most of these models, as others have said.

You can finetune language models, but you don't add knowledge as such to them; you bias them to output more words in similar order to your sample data; it won't add facts as such if you do this fine tuning.

One approach you can do though is semantic search through your notes for a given topic/search query. You basically collect the relevant notes with meanings similar to your topic/search query. Then you can populate a prompt with that text. The answer will use that information and any facts, if the model is big enough and RLHF tuned (like ChatGPT/Instruct/text-00x models from OpenAI).

An open source module for this is GPTIndex, I also work on a commercial solution which encompasses videos etc too and has some optimisations. It is possible you can add data/facts from the internet to the prompt(context) at time of generation too; you can use an approach like WebGPT.

3

waterstrider123 t1_j5zdw3p wrote

Thanks, but I guess I should also mention I was looking for a free solution

1

stargazer1Q84 t1_j5vazpd wrote

If you want to go the Semantic Search route, make sure to check out the deepset.ai haystack framework in conjunction with a sentence-transformer. They make semantic document retrieval very easy to set up and there's many, high-performing pre-trained models for semantic search on hugging face

1