Submitted by darkbluetwilight t3_123j77g in MachineLearning

Step 1 in my efforts to have a robot do my job for me :P has led to a successful implementation of Llama Index. I used "GPTSimpleVectorIndex" to read in a folder of 140 procedures (1 million tokens) into a single json which I can then query with "index.query". It works flawlessly giving me excellent responses. However, it costs quite a bit - anywhere from 0 to 30c per query. I think this comes down to it using Davinci 3 rather than GPT3.5 Turbo which does not appear to be implemented with Llama yet. It appears to always use the full whack of 4096 tokens too.

Just wondering if there is a way of keeping the price down without imposing a smaller max token limit? I was thinking of maybe using some form of lemmatization or POS to condense down the context as much as possible but not sure if this would harm the accuracy. Any suggestions appreciated!

Update: thanks to @supreethrao, GPT3.5-Turbo is in fact implemented in Llama-index. Price per request instantly cut to one tenth of the cost. Just use these lines in python when building your index:
from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader, LLMPredictor
from langchain.llms import OpenAIChat
data = SimpleDirectoryReader('database').load_data() #'database' is the folder that contains your documents
llm_predictor = LLMPredictor(llm=OpenAIChat(temperature=0.7, model_name="gpt-3.5-turbo")) #set the model parameters
index = GPTSimpleVectorIndex(data, llm_predictor=llm_predictor) # create the index
response = index.query("How to create an engineering drawing?") #query the index
print(response)
Update2: After using the robot for a while, I've found that the responses from GPT3.5-Turbo have been very basic and unhelpful. It often says "yes the context contains the information you are asking about". Other times it just says "the context does not have the information to answer that question", which is untrue as I have the program print the context to the console and it is always contains very apt information to answer the query. Not sure if it's just not getting enough tokens to answer my query or if there is something more serious in GPT3.5's architecture that is just not very well suited to this task. Will have to do a bit more trial and error to figure it out.

8

Comments

You must log in or register to comment.

supreethrao t1_jdv1whe wrote

Hi, there’s already support for ‘gpt-3.5-turbo’ in llama index , the examples can found in the git repo . You can also switch for SimpleVectorIndex to a TreeIndex , this could lower your cost

4

darkbluetwilight OP t1_jdv9560 wrote

You are a gentleman! There doesn't appear to be any documentation in the llama-index docs yet but there is support added via the langchain module. It looks like I can "from langchain.llms import OpenAIChat" and then use this function to build a new index using "gpt-3.5-turbo" model. I will give this a go and see if it works. I will look into Treeindex too, reading the docs around these different indexing tools was getting a bit too complex for me

4

Smallpaul t1_jdwt4ao wrote

So I guess LlamaIndex has nothing to do with Meta's LLaMa except that they both have "LLM" in their names? They switched from one confusing name to another!

3

supreethrao t1_jdy0xtd wrote

Hi , to address Update2 , I think you’ll have to change your prompt to GPT3.5-turbo significantly. LLama index also has a cost estimator function that assumes a dummy LLM backend and calculates the expected cost , you can also use OpenAI’s tokenizer called “tiktoken” which is available on GitHub to calculate the exact number of tokens your text produces

1

rshah4 t1_jdy111n wrote

How about using embeddings from open-source models like those at Hugging Face. That would save your embedding costs.

2

darkbluetwilight OP t1_jdy5v9e wrote

I think you are correct. I started being more specific in my prompts and even telling it what level of detail I wanted back and it is showing a lot more promise now, the responses are much more useful. Makes me a little concerned for when asking it about things I'm less familiear with, might need to fall back to davinci.
I wonder why davinci comes across as being more intelligent than gpt3.5? Maybe the reduced cost has something to do with this, less compute power behind it maybe?
With regard to the token count my program is a lot more complex than the code I provided in the OP, with a lot of context and token management features in there already so was able to rule out potential token availability issues.

1

darkbluetwilight OP t1_jdy6eu6 wrote

Nice suggestion thanks! llama-index currently uses an embedding version of Ada which has negligible pricing (0.0002/1000tokens I think) The once-off index creation (1.3million tokens) cost about 40c.
It was the AI text generation costs that was killing me.

2

machineko t1_je888dc wrote

Why not use open source models. Especially it seems like you are trying not to sell the model for commercial purposes, you can easily replace it with open source models. Also, for retrieval-augmented generation, smaller models can be very effective.

2

darkbluetwilight OP t1_je99g95 wrote

Correct, it's for personal use only. I did look into a few different options - Huggingface, Alpaca, BERT, Chinchilla, Cerebras but they all appear to have charges too (with the exception of Alpaca which was taken down). I already had openai nicely implemented in my GUI so wasn't really drawn by any of them.
Can you suggest a model that is free or cheaper than openai that I could integrate into my python gui?
On the database side I tried Mongo DB and Atlas but found these very difficult to use. Since I only need to generate the database once, Llama index was fine to use

1

machineko t1_jecw2v4 wrote

Cerebras-GPT models are Apache-2.0. You should be able to use them for free. Not sure what you mean by charges. Are you referring to using the hosted APIs?

Btw, you should use the ones that are instruction fine-tuned.

3