Submitted by _underlines_ t3_zstequ in MachineLearning

Edit: Found LAION-AI/OPEN-ASSISTANT a very promising project opensourcing the idea of chatGPT. video here

TL;DR: I found GPU compute to be generally cheap and spot or on-demand instances can be launched on AWS for a few USD / hour up to over 100GB vRAM. So I thought it would make sense to run your own SOTA LLM like Bloomz 176B inference endpoint whenever you need it for a few questions to answer. I thought it would still make more sense than shoving money into a closed walled garden like "not-so-OpenAi" when they make ChatGPT or GPT-4 available for $$$. But I struggle due to lack of tutorials/resources.

Therefore, I carefully checked benchmarks, model parameters and sizes as well as training sources for all SOTA LLMs here.

Knowing since reading the Chinchilla paper that Model Scaling according to OpenAI was wrong and more params != better quality generation. So I was looking for the best performing LLM openly available in terms of quality and broadness to use for multilingual everyday questions/code completion/reasoning similar to what chatGPT provides (minus the fine-tuning for chat-style conversations).

My choice fell on Bloomz (because that handles multi-lingual questions well and has good zero shot performance for instructions and Q&A style text generation. Confusingly Galactica seems to outperform Bloom on several benchmarks. But since Galactica had a very narrow training set only using scientific papers, I guess usage is probably limited for answers on non-scientific topics.

Therefore I tried running the original bloom 176B and alternatively also Bloomz 176B on AWS SageMaker JumpStart, which should be a one click deployment. This fails after 20min. On Azure ML, I tried using DeepSpeed-MII which also supports bloom but also fails due the instance size of max 12GB vRAM I guess.

From my understanding to save costs on inference, it's probably possible to use one or multiple of the following solutions:

  • Precision: int8 instead of fp16
  • Microsoft/DeepSpeed-MII for an up 40x reduction on inference cost on Azure, this thing also supports int8 and fp16 bloom out of the box, but it fails on Azure due to instance size.
  • facebook/xformer not sure, but if I remember correctly this brought inference requirements down to 4GB vRAM for StableDiffusion and DreamBooth fine-tuning to 10GB. No idea if this is usefull for Bloom(z) inference cost reduction though

I have a CompSci background but I am not familiar with most stuff, except that I was running StableDiffusion since day one on my rtx3080 using linux and also doing fine-tuning with DreamBooth. But that was all just following youtube tutorials. I can't find a single post or youtube video of anyone explaining a full BLOOM / Galactica / BLOOMZ inference deployment on cloud platforms like AWS/Azure using one of the optimizations mentioned above, yet alone deployment of the raw model. :(

I still can't figure it out by myself after 3 days.

TL;DR2: Trying to find likeminded people who are interested to run open source SOTA LLMs for when chatGPT will be paid or just for fun.

Any comments, inputs, rants, counter-arguments are welcome.

/end of rant

326

Comments

You must log in or register to comment.

londons_explorer t1_j1a3zrf wrote

I've got a feeling chatGPT benefits massively from it's human-curated finetuning feedback loop.

Thats hard to reproduce without tens of thousands of man-hours upvoting/downvoting/editing the bots responses.

178

satireplusplus t1_j1afqub wrote

This ^^

Compared to GPT3, ChatGPT is a huge step up. There is basically an entire new reward network, as large as the LM, that is able to judge the quality of the answers. See https://cdn.openai.com/chatgpt/draft-20221129c/ChatGPT_Diagram.svg

That said, I'd welome a community effort to build an open source version of this.

81

sanman t1_j1b9tun wrote

Do we know when ChatGPT itself will cease to be free, or cease to be available to the general public? I kind of like using this thing - I find it really convenient, so I'd like to know when I'm going to lose access to it.

10

amhotw t1_j1bnmw5 wrote

I mean it is pretty cheap. You probably can't spend more than $10/month if it is priced similar to gpt3.

11

ktpr t1_j1cb1nd wrote

I suspect they’ll move towards paid tiers when the popularity goes down. Right now they’re getting a ton of interesting and rich data for free from going viral. But when that eventually fades they’ll want to continue generating some kind of value from it.

7

EthansWay007 t1_j1w05nk wrote

I’m curious, how do they use the data of it being asking questions to improve it? Does it flag questions it couldn’t answer and then the team updates it?

1

Nextil t1_j1zqxp9 wrote

You can rate the responses up or down and provide an "ideal" response.

2

gelukuMLG t1_j23znll wrote

I think it saves the highly rated responses and feeds it into a dataset then it uses reinforcement learning by giving a positive reward to them.

1

f10101 t1_j1cm39r wrote

Step 1 definitely explains why its responses often feel so similar to SEO waffle-farm content. I had been wondering where that aspect was coming from.

3

maxToTheJ t1_j1c6ut5 wrote

Yup. The training techniques have got a lot better since that first GPT-3 paper.

0

pilibitti t1_j1ai82j wrote

it can be crowdsourced once we have something up and running. this stuff will be commoditized eventually.

19

IWantAGrapeInMyMouth t1_j1b13kx wrote

It really does but there’s a point in time where OpenAI is going to want to cash in. Virtually all of their outputs could benefit from utilizing reinforcement learning to improve after the initial training, but we’ve seen how GPT3 and DallE-2 ultimately chose to be shipped as a sort of finished product that gets updates like any shipped app might, with costs attached. I don’t see why ChatGPT will be any different after x amount of time, unless Stable Diffusion is really eating their Dall-E 2 profitability and they need to find new ways of monetization that doesn’t charge the user utilizing ChatGPT

9

sanman t1_j1bacsy wrote

Well, remember when Youtube was totally free without any ads whatsoever? And of course we all wondered how they were going to continue offering their service for free. Then one day the ads crept in, and we knew.

I'm thinking OpenAI hasn't made this thing free just for generosity. They're using us as free beta-testers to shake down the product for them, so that they can iron out the kinks and bugs. Once that process has run its course, they'll just cut off our access and only allow paying customers to use it.

13

jrkirby t1_j1bnhkx wrote

Why do you think they'll make us pay, when they could instead the treasure trove of personal information to sell to advertisers and train the AI to subliminally (or explicitly) advertise to us?

6

sanman t1_j1bqgvb wrote

I wonder if there'll be a new budding industry for SEO with GPT, just like there is for SEO with Google search? I'm not sure how that would work though, since it might be harder to integrate spam/ads into GPT responses.

2

KimmiG1 t1_j1bf7bb wrote

I'm curious if they keep a free version that sneeks inn adds as natural conversations where it fits.

3

slashtom t1_j1bf92g wrote

Well, they're also getting feedback and the model is only being improved by human interaction. I'd bet they still keep a free tier in order to get access to a broader pool and charge companies/people a subscription fee if they want unlimited access or something.

1

lucidrage t1_j1c5jxp wrote

Imagine if chatgpt was ad supported... You just invented a new business model!

1

harharveryfunny t1_j1d5m40 wrote

Yes - not sure if everyone understands this. ChatGPT took GPT 3.5 as a starting point, but then has a reinforcement learning stage on top of that which has aligned it's output to what humans want from a question-answering chat-bot. It's basically the next generation InstructGPT.

https://arxiv.org/abs/2203.02155

From a quick scan of the Bloomz link, that seems to be just an LLM (i.e. more like GPT-3), not an instruction/human aligned chat-bot. There's a huge qualitative difference.

2

the-z t1_j1b8h6i wrote

To be fair, that's roughly how natural minds are trained, too.

1

meyerhot t1_j1cg6jj wrote

Anyone have any ideas about how they assigned rewards? Somehow take the sum of the prob(logits) from each token in the sentence and multiply that by the reward?

1

maizeq t1_j1cj523 wrote

10s of thousands of hours splits across thousands of people does not seem too significant.

1

x246ab t1_j1utz7f wrote

Very true, but it only needs one good data dump hack

1

step21 t1_j19vr7h wrote

‚Should be a one click deployment‘ lol, famous last words

140

artsybashev t1_j1c7pzh wrote

a lot of stuff can be run locally with git clone ... and docer compose up

13

Jonno_FTW t1_j1ctwja wrote

docker run ... in the ideal world, assuming someone made the Docker image properly.

1

lolorenz t1_j1cztpc wrote

Docker compose is a service that allows to controll multiple docker container and handle their interactions. So docker compose up already is in an ideal world :P

5

Jonno_FTW t1_j1gjx2f wrote

We use docker compose so much at work we have alias dc=docker compose on most of our cloud deployments.

1

Charuru t1_j1a8a71 wrote

I don't think the quality is usable for most of these open sourced models, really need another generation of improvement.

17

SirReal14 t1_j1axbn1 wrote

Another option is to work with/contribute to a distributed implementation of large language models. The Petals project is running BLOOM over a decentralized network of small workers (min 8GB VRAM requirement)

17

Soc13In t1_j1bdyxw wrote

Can Radeon cards work or is it Nvidia only?

1

coolbreeze770 t1_j1a8mo0 wrote

Or just pay .004c per api query? And open AI will allow you to fine tune their model to your own needs

Edit: I dont know the precise cost just pulled that number out of my ass

15

judasblue t1_j1af1ug wrote

That's high by an order of mag :)

5

pilibitti t1_j1aimsd wrote

I think they price by generated token in their other products? if so there should be a way to make chatgpt less verbose out of the box.

also this stuff will be a lot more popular than the other products but the hardware power isn't really there for such demand using older prices I assume. So it might be a bit more expensive than their other offerings.

3

judasblue t1_j1akvas wrote

Oh, I was just pointing out that 1000 tokens in their base model for other services is 0.0004, so an order of mag lower than u/coolbreeze770 was guessing. In other words, pretty friggin cheap for most since a rough way to think about it is three tokens equaling two words on average.

edited for clunky wording

3

f10101 t1_j1cmsob wrote

Just in case you miss my other comment - chatgpt seems to actually be particularly expensive to run in comparison to their other apis. Altman says "single digit cents per chat".

1

caedin8 t1_j1asist wrote

As soon as we can fine tune it to our problem space, we are 100% putting it as a help bot in our commercial software. It’s ready, it just needs tuning.

2

IWantAGrapeInMyMouth t1_j1b1vl8 wrote

I imagine there’ll be open source versions of ChatGPT in the near future given it’s wild popularity, I’ll probably just use that for personal projects, and in a business setting I would just have a dedicated model of that open source version running. .004 cents per 1000 tokens (or much less) is a hell of an ask if you’re doing anything where users generate tokens

2

sanman t1_j1bakvl wrote

Open Source is only free when it's running off your own computer. Otherwise, if it's running off some infrastructure, then that has to be paid for - typically with ads or something like that.

12

IWantAGrapeInMyMouth t1_j1bzofo wrote

Usually inference on hugging face for large models is free for individuals making a reasonable amount of API calls as part of their offerings, and I assume an open source version of this would be on there. I realize that it costs money.

3

Cryptheon t1_j1cq8q3 wrote

Hi, I'm a high performance machine learning consultant working on this. I've run BLOOM on a cluster (not exactly aws/azure).

You could, if you have a large enough GPU, run BLOOM on one GPU by running it one layer at a time, this can simply and naively be done using huggingface. I've tested this, for instance, using 4 40GB VRAM NVIDIA A100s (160GB Vram in total). Inference time for 50 tokens still took 40 mins out of the box; using bf16. If you want to bring this down and make it cost effective you need to have at least 8 80GBs A100 (640 GB VRAM). Int8 will slash this requirement by half, however that means sacrificing inference time due to the nature of the int8 method. On top of that, there are still some optimizations on a cluster level you will have to do if you really want to bring that inference time down to a few miliseconds per token generation. This is probably how OpenAI does it; they keep models continuously loaded on their GPUs, with highly optimized methods, so we can all use their models en-masse.

Point being, this is not something trivial to do and will cost money, expertise and time. Besides, BLOOM is not the best model performance wise because it's a multi language model. As others have mentioned, OpenAI's chat-gpt has further been trained using RL (PPO) on data we don't have access to.

11

Evoke_App t1_j1cwajo wrote

>run BLOOM on one GPU by running it one layer at a time, this can simply and naively be done using huggingface
>
>I've tested this, for instance, using 4 40GB VRAM NVIDIA A100s (160GB Vram in total)

Is it possible to also load it one layer at a time using 24x32GB V100s as well? And would that save on costs (compared to 8x80 A100s) without sacrificing throughput too much?

I'd just like to see if this is worth it before delving too deep into it haha.

1

Cryptheon t1_j1g9v0r wrote

You won't need to load it one layer at a time with enough VRAM. 24x32GB V100s should be enough to load the whole model and do inference. The main bottleneck is GPU-GPU communication and the speed of the GPUs for inference.

In theory you can use one 16GB+ GPU and load it one layer at a time, but this will take way too long for generation. During my tests, each layer loading + inference took ~1.2s. BLOOM 175B has 72 ish layers. So just one token prediction can take roughly 1.5 min with this method. That's waaaay too slow.

2

gettheflyoffmycock t1_j1bzvsv wrote

I’ve had to deploy a lot of deep learning, there will not be a simple easy slap on deployment of something like this. Furthermore, it is not going to be cheaper. First of all, I’m not sure if it requires a graphics card, but in AWS there is a one hour minimum unless you use a more expensive contract. So when you make a API request, it’s going to charge you the full three dollar minimum or up to $20 depending on what instance you are using.

Furthermore, the cold start time. If you have it shut down when not in use its like at least 5 to 10 minutes for a model of this size to get up and running. The only way this is cost-effective is if it can run on CPU only, it could fit on an extremely cheap or free AWS. But my guess is that models like this are not going to be able to run fast enough to make it worth it with only CPU.

can anyone chime in if state of the art text generation models like this can run on CPU only?

9

maxToTheJ t1_j1c75bp wrote

You are 100% right. However people will do like DALL-E and make a budget mickey mouse version and pretend its the exact same thing without measuring any quantitative metrics between the original implementation and theirs.

1

gettheflyoffmycock t1_j1chv31 wrote

Yeah, funny how many people have been advertising on all the machine learning subreddits their new chat GPT application. Which is funny because Chat GPT doesn’t have a single API yet.

Kinda funny, AI is ending up like drop shipping. the art of advertising shitty AliExpress products as if they’re actually a better product, and then up charge people like 500 or 1000%, then you just order the AliExpress product and have it mailed to their house. It’s like people are doing that with AI now. Just say it’s this or that and then put a super lightweight model like OpenAI Davinci on a free AWS instance and call it chat GPT. Business models built on “If da Vinci charges you four cents per API credit just charge the user eight Cents “ what will they know?

4

mrcschwering t1_j1easrz wrote

I have only deployed a few models (smaller BERT-like) and was able to fit some of them into Lambda function (load from S3).

Otherwise, if we don't care about start-up time, a lambda function that starts a spot instance.

1

race2tb t1_j1bw4zv wrote

OpenAi is better off with lower profits and higher engagement since the engagement is what fuels their models progress. I cannot say for sure what they will do, but right now is not the time to be trying to be exclusive. They should work for on some kind of feedback, reputation credit system that lets you earn by helping them fine tune.

7

kamalilooo t1_j1c3l20 wrote

That would be a new era of publishing. A new content ecosystem and a complete redesign of how revenue is shared. Google isn't releasing lambda because they don't have the answer. SEO is on its death bed and no one knows how to make a sustainable ecosystem because the rise of Chatgpt will eliminate most of the current incentives to publish content that will eventually be needed to update the LLMs.

3

_underlines_ OP t1_j1c8324 wrote

brands / advertisers pay money to the LLM platform to run highly targeted ads in the LLM interface (for example chatGPT, lawGPT, medGPT etc.) the LLM platform pays a share of that adrevenue to content creators, that it uses for training and finetuning.

3

kamalilooo t1_j1ce0ss wrote

I think this is the most sensible take I've heard on the future of written content but how feasible do you think it is in terms of computation? Sounds like youd need a whole new artificial intelligence just for ads to pull it off, and then somehow integrate it with the LLM.

Sorry if its stupid I know nothing about AI. I'm a content writer with existential dread and severe whiplash from all this hype.

Ultimately, we need a system to incentivise human writers otherwise I dont see LLMs scaling

2

_underlines_ OP t1_j1cgmie wrote

no need to integrate the ads into the LLM. Just integrate it into the UI that users use to converse with the AI. Between Answers you can either inject ads, or you can alter answers to contain certain brands.

Very unethical, and that's why I hope this becomes detached from big corps like OpenAI that do this behind a locked down API...

5

kamalilooo t1_j1ckyk7 wrote

So OpenAi gets all the revenue from online advertising, and ends up removing the incentive to publish new content, limiting the usefulness of the LLM because it will be 'stuck in time' in a sense.( Not sure if this is a fair assessment )

Do you think the influx of data they get from our interactions with Chatgpt can make up for the existence of human writers updating google ( and the web) with new data/information as it emerges in real life?

How will ai add anything to the conversation if its stuck in time?

1

PrinceOfLies0 t1_j1amh6l wrote

Hey, I would gladly join your effort, I got a similar background and certain concerns regarding the direction of OpenAI in their approach to censorship. Currently still mostly inexperienced with machine learning, with a mediocare understanding of the linear algebra algo's behind it. I intend to use (and currently partially use) ML for image gen, improving formal software verification by possibly generating SMT conditions and such + aiding procedural generation algo's...

I would not be too concerned with ChatGPT costing a bit of money but rather the API or functionality being neutered because "too powerful". As such, I rather have control over the the whole AI stack.

Long term, I would also like to investigate the possibility for massive GPU based distributed training, similar to Folding@home just for generating models.

Discord/ Element/ Telegram - I am free to talk :)

6

fqrh t1_j1avi8q wrote

I have fear of missing out when ChatGPT censors itself. Ideally, if someone pays for a chatbot themselves, they can get uncensored responses from it.

6

maizeq t1_j1ckd3o wrote

I would be interested in helping. (Currently in AI research but not focussed on LLMs).

I don’t like the idea that the user feedback OpenAI is accumulating from ChatGPT is contributing to deepening their moat (I highly doubt they will release all that data publicly).

For a company founded on principles of openness to be working directly against the democratisation of AI, some serious criticism is warranted I think.

I could perhaps understand if there was a need for profitability to ensure the cost of their research, but the models they are commercialising are by and large models based on the research of other labs which are far more open with releasing their work. Their closed approach will simply incentivise and push other research labs to make their research more closed also, further increasing the likelihood of AI being concentrated in the hands of very few.

6

blose1 t1_j1apar1 wrote

Even with int8 you need at least 175 GB of VRAM to run one model instance, time to launch and load it on demand will be higher that using openai api and your performance will be lower. Forget about running current generation of LLMs like OPT/BLOOM in cloud for real world cases, they are crap, I've tested them, they loop all the time and they can't match chatGPT results, you will not get performance of chatGPT from them without human assisted RL step that openai did. So wait for next gen of open source models or just use chatGPT.

5

Bartmoss t1_j1aqxem wrote

I think playing around with a nice encoder-decoder like T5 is a great start. Trying the original model is already nice, the newer flan-t5 can be better for some few shot tasks. The base models are already pretty good. Even the small models perform pretty well. I haven't tried the t5-tiny yet, but it is on my list to play with.

Of course if you have specific tasks in respect to generating texts, you could do some fine-tuning of T5. You can even use the same model for fine-tuning on several tasks with different prompts. I have found that for some tasks (especially where a sequence-to-sequence model have advantages), a fine-tuned T5 (or some variant thereof) can beat a zero, few, or even fine-tuned GPT-3 model.

It can be suprising what such encoder-decoder models can do with prompt prefixes, and few shot learning and can be a good starting point to play with large language models.

4

cajmorgans t1_j1cj3is wrote

We seriously need to create an open source model, it’s important that one company don’t get the whole market share in these powerful tools.

4

jbreezeai t1_j1cdo0r wrote

Im interested. I have played around with gpt models and Bert. Not got into bloomz yet. I have trained gpt3 custom models on openai. My team has worked with lot more.

My concerns with openai is there is not clarity of my data will be reused or adapted into their general models. Second training gpt3 is very cumbersome and not flexible.

Advantage of openai: training the models and deploying it all api based so no infra and devops / mlops overhead.

I think ultimately cost will almost be in parity across all clouds with 10-20% delta. The automation will what be xtra cost. Do you pay openai or aws for automation or hire someone to do it.

3

pan_berbelek t1_j1cpkth wrote

I'm trying to do basically the same thing and yes, running bloom does require a lot of memory. I managed to run it on:

  • ordinary computer with no GPU and 16GB of RAM, by loading parts of the model (divided to 73 parts) every time for every token. But this is painfully slow: 2-3 minutes per single token produced
  • a VM in Azure with no GPU but with lots of RAM (600+GB). This can generate a single token in 2-3 seconds, still way too slow for my usecase

Now I'm trying to run on a Azure VM with 8 A100 GPUs, as is recommended by Bloom authors, but this of course is significantly more expensive: the right sized VM costs $35 per hour. From what I read this setup could be capable in generating a single token in less than 1 millisecond, and if this is really true then this means this setup is actually the cheapest one for my usecase, despite high VM cost, but I need to validate first if I can really achieve this speed.

3

tripple13 t1_j1cu0yj wrote

God I love this post.

More genuine passion in this sub, please!

Keep us updated on your progress, would be great to follow.

2

ShowerVagina t1_j1d0u61 wrote

My biggest pet peeve with chatGPT is how sanitized it is. I want a chat bot i can experiment with. I want a chat bot that will argue why the earth should be destroyed by an asteroid. Can SOTA LLM's do that?

In terms of GPU compute, I'd highly recommend Paperspace's $40 a month pro plan. You get access to these GPU's for free and your instances live for up to 6 hours with your files and storage persisting between runs. Though, capacity is limited on higher end GPU's but you can reliably get at least an A5000 at most times. So I'm happy to help with processing power.

2

yahma t1_j1dulgw wrote

Based on my testing, none of the open source models are anywhere near as good as ChatGPT (or even davinci-03 .. the lastest GPT-3 snapshot).

I think open source models need more fine-tuning and some RL techniques applied to get anywhere close.

2

meyerhot t1_j1cfyrj wrote

I am really interested in this and have been looking into doing some sort of finetuning on an LLM like GLM or Bloom. I had this idea for human in the loop in grad school but wasn’t able to implement how to assign the rewards to the sentences when the text generation is token by token.

1

ztapper t1_j1chmq7 wrote

I would participate here. We have some use cases in the always on/semi supervised learning space that might be helpful.

1

rodio346 t1_j1cpn1y wrote

Saving this post to come back to it after my exams

1

Delicious-View-8688 t1_j1ctzdn wrote

Isn't Microsoft Azure AI's Davinci sort of what this is?

1

canttouchmypingas t1_j1gq2th wrote

Someone will post their implementation on github soon if they haven't already. All we really need is an open source dataset and we'd be good to go. Barrier to entry only being setting up the AWS instance to train your model.

This would allow different communities to develop their own datasets - if programmers pulled together with the ChatGPT hype to make a large programming dataset, we'd have a much more capable github copilot relatively soon.

Just need the open source datasets and an implementation; the later usually comes but the former is elusive.

1

BestSentence4868 t1_j1bjak3 wrote

run int8 instead of fp16 gg rip

0

Last-Caterpillar-112 t1_j1cvgx3 wrote

So you are saying that when ChatGPT, which we are all perfectly happy with, starts charging a few dollars a month, we or you or someone else should spend a TON of money AND unknown effort to roll their own hastily trained, half-assed LLM in a couple of months with mixed results? And this potential ChatGPT-killer will be altruistic and free forever?

0

yellowbrowntwo2 t1_j1bpzoe wrote

thanks for the information and share, seems that GPT4 (and beyond) and competitors will have the advertisement supported model like current search engines (ex. google)

I am sure everyone will agree that ai will be at 100% in many IQ test as compared in your google docs.

AI has shown worthwhile results on common sense, physical world and reasoning comparable to adult humans presently (2022).

seems that these ai chat engines has less memory

−3