Submitted by minimaxir t3_11fbccz in MachineLearning

https://openai.com/blog/introducing-chatgpt-and-whisper-apis

> It is priced at $0.002 per 1k tokens, which is 10x cheaper than our existing GPT-3.5 models.

This is a massive, massive deal. For context, the reason GPT-3 apps took off over the past few months before ChatGPT went viral is because a) text-davinci-003 was released and was a significant performance increase and b) the cost was cut from $0.06/1k tokens to $0.02/1k tokens, which made consumer applications feasible without a large upfront cost.

A much better model and a 1/10th cost warps the economics completely to the point that it may be better than in-house finetuned LLMs.

I have no idea how OpenAI can make money on this. This has to be a loss-leader to lock out competitors before they even get off the ground.

574

Comments

You must log in or register to comment.

Educational-Net303 t1_jair4wf wrote

Definitely a loss-leader to cut off Claude/bard, electricity alone would cost more than that. Expect a rise in price in 1 or 2 months

68

harharveryfunny t1_jairuhd wrote

It says they've cut their costs by 90%, and are passing that saving onto the user. I'd have to guess that they are making money on this, not just treating it as a loss-leader for other more expensive models.

The way the API works is that you have to send the entire conversation each time, and the tokens you will be billed for include both those you send and the API's response (which you are likely to append to the conversation and send back to them, getting billed again and again as the conversation progresses). By the time you've hit the 4K token limit of this API, there will have been a bunch of back and forth - you'll have paid a lot more than 4K * 0.2c/1K for the conversation. It's easy to imagine chat-based API's becoming very widespread and the billable volume becoming huge. OpenAI are using Microsoft Azure compute, who may see a large spike in usage/profits out of this.

It'll be interesting to see how this pricing, and that of competitors evolves. Interesting to see also some of OpenAI's annual price plans outlined elsewhere such as $800K/yr for their 8K token limit "DV" model (DaVinci 4.0?), and $1.5M/yr for the 32K token limit "DV" model.

69

JackBlemming t1_jaisvp4 wrote

Definitely. This is so they can become entrenched and collect massive amounts of data. It also discourages competition, since they won't be able to compete against these artificially low prices. This is not good for the community. This would be equivalent to opening up a restaurant and giving away food for free, then jacking up prices when the adjacent restaurants go bankrupt. OpenAI are not good guys.

I will rescind my comment and personally apologize if they release ChatGPT code, but we all know that will never happen, unless they have a better product lined up.

68

lostmsu t1_jaj0dw2 wrote

I would love an electricity estimate for running GPT-3-sized models with optimal configuration.

According to my own estimate, electricity cost for a lifetime (~5y) of a 350W GPU is between $1k-$1.6k. Which means for enterprise-class GPUs electricity is dwarfed by the cost of the GPU itself.

14

LetterRip t1_jaj1kp3 wrote

> I have no idea how OpenAI can make money on this.

Quantizing to mixed int8/int4 - 70% hardware reduction and 3x speed increase compared to float16 with essentially no loss in quality.

A*.3/3 = 10% of the cost.

Switch from quadratic to memory efficient attention. 10x-20x increase in batch size.

So we are talking it taking about 1% of the resources and a 10x price reduction - they should be 90% more profitable compared to when they introduced GPT-3.

edit - see MS DeepSpeed MII - showing a 40x per token cost reduction for Bloom-176B vs default implementation

https://github.com/microsoft/DeepSpeed-MII

Also there are additional ways to reduce cost not covered above - pruning, graph optimization, teacher student distillation. I think teacher student distillation is extremely likely given reports that it has difficulty with more complex prompts.

252

jturp-sc t1_jaj2w4j wrote

Glad to see them make ChatGPT accessible via API and go back to update their documentation to be more clear on which model is which.

I had an exhausting number of conversations with confused product managers, engineers and marketing managers on "No, we're not using ChatGPT".

19

Timdegreat t1_jaj3gpr wrote

Will we be able to generate embeddings using the ChatGPT API?

9

jturp-sc t1_jaj45ek wrote

The entry costs have always been so high that LLMs as a service was going to be a winner-take-most marketplace.

I think the best hope is to see other major players enter the space either commercially or as FOSS. I think the former is more likely, and I was really hoping that we would see PaLM on GCP or even something crazier like a Meta-Amazon partnership for LLaMa on AWS.

Unfortunately, I don't think any of those orgs will pivot fast enough until some damage is done.

27

harharveryfunny t1_jaj8bk2 wrote

Could you put any numbers to that ?

What are the FLOPS per token inference for a given prompt length (for a given model)?

What do those FLOPS translate to in terms of run time on Azure's GPUs (V100's ?)

What is the GPU power consumption and data center electricity costs ?

Even with these numbers can we really relate this to their $/token pricing scheme ? The pricing page mentions this 90% cost reduction being for the "gpt-3.5-turbo" model vs the earlier davinci-text-3.5 (?) one - do we even know the architectural details to get the FLOPs ?

4

WarProfessional3278 t1_jaj9nnt wrote

Rough estimate: with one 400w gpu and $0.14/hr electricity, you are looking at ~0.00016/sec here. That's the price for running the GPU alone, not accounting server costs etc.

I'm not sure if there are any reliable estimate on FLOPS per token inference, though I will be happy to be proven wrong :)

3

luckyj t1_jajaz53 wrote

But that (sending the whole or part of the conversation history) is exactly what we had to do with text-davinci if we wanted to give it some type of memory. It's the same thing with a different format, and 10% of the price... And having tested it, it's more like chatgpt (I'm sorry, I'm a language model type of replies), which I'm not very fond of. But the price... Hard to resist. I've just ported my bot to this new model and will play with it for a few days

24

Purplekeyboard t1_jajcnb5 wrote

> This is not good for the community.

When GPT-3 first came out and prices were posted, everyone complained about how expensive it was, and that it was prohibitively expensive for a lot of uses. Now it's too cheap? What is the acceptable price range?

6

badabummbadabing t1_jajdjmr wrote

Honestly, I have become a lot more optimistic regarding the prospect of monopolies in this space.

When we were still in the phase of 'just add even more parameters', the future seemed to be headed that way. With Chinchilla scaling (and looking at results of e.g. LLaMA), things look quite a bit more optimistic. Consider that ChatGPT is reportedly much lighter than GPT3. At some point, the availability of data will be the bottleneck (which is where an early entry into the market can help getting an advantage in terms of collecting said data), whereas compute will become cheaper and cheaper.

The training costs lie in the low millions (10M was the cited number for GPT3), which is a joke compared to the startup costs of many, many industries. So while this won't be something that anyone can train, I think it's more likely that there will be a few big players (rather than a single one) going forward.

I think one big question is whether OpenAI can leverage user interaction for training purposes -- if that is the case, they can gain an advantage that will be much harder to catch up to.

24

LetterRip t1_jajezib wrote

June 11, 2020 is the date of the GPT-3 API was introduced. No int4 support and the Ampere architecture with int8 support had only been introduced weeks prior. So the pricing was set based on float16 architecture.

Memory efficient attention is from a few months ago.

ChatGPT was just introduced a few months ago.

The question was 'how OpenAI' could be making a profit, if they were making a profit on GPT-3 2020 pricing; then they should be making 90% more profit per token on the new pricing.

52

JackBlemming t1_jajg4dz wrote

It's not about the price, it's about the strategy. Google maps API was dirt cheap so nobody competed, then they cranked up prices 1400% once they had years of advantage and market lock in. That's not ok.

If OpenAI keeps prices stable, nobody will complain, but this is likely a market capturing play. They even said they were losing money on every request, but maybe that's not true anymore.

18

bmc2 t1_jajjjvd wrote

Training based on submitted data is going to be curtailed according to their announcement:

“Data submitted through the API is no longer used for service improvements (including model training) unless the organization opts in”

5

VertexMachine t1_jajjq8b wrote

Yea, but one thing is not adding up. It's not like I can go to a competitor and get access to similar level of quality API.

Plus if it's a price war... with Google.. that would be stupid. Even with Microsoft's money, Alphabet Inc is not someone you want to go to war on undercutting prices.

Also they updated their polices on using users data, so the data gathering argument doesn't seem valid as well (if you trust them)


Edit: ah, btw. I don't say that there is no ulterior motive here. I don't really trust "Open"AI since the "GPT2-is-to-dangerous-to-release" bs (and corporate restructuring). Just that I don't think is that simple.

13

farmingvillein t1_jajtmly wrote

> Plus if it's a price war... with Google.. that would be stupid

If it is a price war strategy...my guess is that they're not worried about Google.

Or, put another way, if it is Google versus OpenAI, openai is pretty happy about the resulting duopoly. Crushing everyone else in the womb, though, would be valuable.

11

farmingvillein t1_jajw0yj wrote

> The training costs lie in the low millions (10M was the cited number for GPT3), which is a joke compared to the startup costs of many, many industries. So while this won't be something that anyone can train, I think it's more likely that there will be a few big players (rather than a single one) going forward.

Yeah, I think there are two big additional unknowns here:

  1. How hard is it to optimize inference costs? If--for sake of argument--for $100M you can drop your inference unit costs by 10x, that could end up being a very large and very hidden barrier to entry.

  2. How much will SOTA LLMs really cost to train in, say, 1-2-3 years? And how much will SOTA matter?

The current generation will, presumably, get cheaper and easier to train.

But if it turns out that, say, multimodal training at scale is critical to leveling up performance across all modes, that could jack up training costs really, really quickly--e.g., think the costs to suck down and train against a large subset of public video. Potentially layer in synthetic data from agents exploring worlds (basically, videogames...), as well.

Now, it could be that the incremental gains to, say, language are not that high--in which case the LLM (at least as these models exist right now) business probably heavily commoditizes over the next few years.

9

caedin8 t1_jakcasg wrote

It's exciting to see that ChatGPT's cost is 1/10th that of GPT-3 API, which is a huge advantage for developers who are looking for high-quality language models at an affordable price. OpenAI's commitment to providing top-notch AI tools while keeping costs low is commendable and will undoubtedly attract more developers to the platform. It's clear that ChatGPT is a superior option for developers, and OpenAI's dedication to innovation and affordability is sure to make it a top choice for many in the AI community.

−15

MonstarGaming t1_jakqs01 wrote

>I have no idea how OpenAI can make money on this.

Personally, I don't think they can. What is the main use case for chat bots? How many people are going to pay $20/month to talk to a chatbot? I mean, chatbots aren't exactly new... anybody who wanted to chat with one before ChatGPT could have and yet there wasn't an industry for it. Couple that with it not being possible to know whether its answers are fact or fiction and I just don't see the major value proposition.

I'm not overly concerned one way or another, I just don't think the business case is very strong.

−14

xGovernor t1_jaksctz wrote

I've been tinkering with DaVinci but even with turbo/premium using gpt3.5turbo api requires a credit card added to the account. Excited to fool with it, however I typically use 2048-4000 tokens on DaVinci 3.

3

LetterRip t1_jal4vgs wrote

Yep, or a mix between the two.

GLM-130B quantized to int4, OPT and BLOOM int8,

https://arxiv.org/pdf/2210.02414.pdf

Often you'll want to keep the first and last layer as int8 and can do everything else int4. You can quantize based on the layers sensitivity, etc. I also (vaguely) recall a mix of 8bit for weights, and 4bits for biases (or vice versa?),

Here is a survey on quantization methods, for mixed int8/int4 see the section IV. ADVANCED CONCEPTS: QUANTIZATION BELOW 8 BITS

https://arxiv.org/pdf/2103.13630.pdf

Here is a talk on auto48 (automatic mixed int4/int8 quantization)

https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41611/

11

Lychee7 t1_jalbr7l wrote

Criteria for tokens ? Complex, longer the prompt more tokens it'll use ?

1

fmai t1_jalcs0x wrote

AFAIK, flash attention is just a very efficient implementation of attention, so still quadratic in the sequence length. Can this be a sustainable solution for when context windows go to 100s of thousands?

14

Hsemar t1_jalp8as wrote

but does flash attention help with auto-regressive generation? My understanding was that it prevents materializing the large kv dot product during training. At inference (one token at a time) with kv caching this shouldn't be that relevant right?

0

WarAndGeese t1_jalq339 wrote

Don't let it demotivate competitors. They are making money somehow, and planning to make massive amounts more. Hence the space is ripe for tons of competition, and those other companies would also be on track to make tons of money. Hence, jump in competitors, the market is waiting for you.

−1

Stakbrok t1_jam0bpq wrote

You can edit what it replied of course (and then hope it builds off of that and keeps that specific vibe going, which always works in the playground) but damn, they locked it down tight. 😅

Even when you edit the primer/setup into something crazy (you are a grumpy or deranged or whatever assistant) and change some things it said into something crazy, it overrides the custom mood you set for it and goes right back to its ever serious ChatGPT mode. Even sometimes apologizing for saying something out of character (and by that it means the thing you 'made it say' by editing, so it believes it said that)

5

Smallpaul t1_jam83rb wrote

I guess you haven’t visited any B2C websites in the last 5 years.

But also: there is a world model behind the chatbot which can translate between human languages, between computer languages, can compose marketing copy, summarise text...

4

londons_explorer t1_jam8409 wrote

It was an interesting business decision to make a blog post announcing two rather different products (ChatGPT API and Whisper) at the same time...

ChatGPT is a best-in-class, or even only-in-class chatbot API... While Whisper is one of many hosted speech to text solutions.

4

harharveryfunny t1_jamab7m wrote

The two pair up very well though - now that there's a natural language API, you could leverage that for speech->text->ChatGPT. From what I've seen of the Whisper demos, it seems to be the best out there by quite a margin. Does anything else perform as well?

4

ShowerVagina t1_jamiqb4 wrote

> I had an exhausting number of conversations with confused product managers, engineers and marketing managers on “No, we’re not using ChatGPT”.

They use your conversations for further training which means if you use it to help you with proprietary code or documentation, you're effectively disclosing that.

1

Dekans t1_jamokhr wrote

> We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method.

...

> FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy).

In the paper bold is done using the block-sparse version. The Path-X (16K length) is done using regular FlashAttention.

4

lucidraisin t1_jamtx7b wrote

it cannot, the compute still scales quadratically although the memory bottleneck is now gone. however, i see everyone training at 8k or even 16k within two years, which is more than plenty for previously inaccessible problems. for context lengths at the next order of magnitude (say genomics at million basepairs), we will have to see if linear attention (rwkv) pans out, or if recurrent + memory architectures make a comeback.

14

ShowerVagina t1_jamyp12 wrote

I might be in the minority but I strongly believe in unfiltered AI (or a minimal filter, only blocking thing like directions to cool drugs or make weapons). I know they filter it for liability reasons but I wish they didn't.

3

Timdegreat t1_jan7sel wrote

You can use the embeddings to search through documents. First, create embeddings of your documents. Then create an embedding of your search query. Do a similarity measurement between the document embeddings and the search embedding. Surface the top N documents.

1

sebzim4500 t1_jan85s7 wrote

Yeah, I get that's that embeddings are used for semantic search but would you really want to use a model as big as ChatGPT to compute the embeddings? (Given how cheap and effective Ada is)

2

Timdegreat t1_jangbi7 wrote

You got a point there! I haven't given it too much thought really -- I def need to check out ada.

But wouldn't the ChatGPT embeddings still be better? Given that they're cheap, why not use the better option?

1

Sea_Alarm_4725 t1_janmlir wrote

I can’t seem to find anywhere what the token limit per request is? With davinci is something like 4k tokens, what about this new chatgpt api?

1

MonstarGaming t1_jap3jzc wrote

>I guess you haven’t visited any B2C websites in the last 5 years.

I have and that is exactly my point. The main use case is B2C websites, NOT individuals, and there are already very mature products in that space. OpenAI needs to develop a lot of bells, whistles, and integration points with existing technologies (salesforce, service now, etc.) before they can be competitive in that market.

>can translate between human languages

Very valuable, but Google and Microsoft both offer this for free.

>between computer languages

This is niche, but it does seem like an untapped, albeit small, market.

>can compose marketing

Also niche. That being said, would it save time? Marketing materials are highly curated.

>summarise text...

Is this a problem a regular person would pay to have fixed? The maximum input size is 2048 tokens / ~1,500 words / three pages. Assuming an average person pastes in the maximum input, they're summarizing material that would take them 6 minutes to read (Google is saying the average person reads 250 words per minutes). Mind you it isn't saving 6 minutes, they still need to read all of the content ChatGPT produces. Wouldn't the average person just skim the document if they wanted to save time?

To your point, it is clearly a capable technology, but that wasn't my argument. There have been troves of capable technologies that were ultimately unprofitable. While I believe it can be successful in the B2C market, I don't think the value proposition is nearly as strong for individuals.

Anyhow, only time will tell.

−3

MonstarGaming t1_jap8605 wrote

That seems to be the gist of this entire thread. This is the first API most of /r/machinelearning have heard of so it must be best on the market. /s

To your point, there are companies who have been developing speech-to-text for decades. The capability is so unremarkable that most (all?) cloud providers have a speech-to-text offering already and it easily integrates with their other services.

I know this is a hot take, but I don't think OpenAI has a business strategy. They're deploying expensive models that directly compete with entrenched, big tech companies. They can't be thinking they're going to take market share away from GCP, AWS, Azure with technologies that all three offer already, right? Right???

1

fasttosmile t1_japaes4 wrote

To be fair, they are technically very competent and the pricing is very cheap. And their marketing is great.

But yeah dealing with B2B customers (where the money is) and integrating feedback from them is a very different thing than what they've been doing so far. They might be angling to serve as a platform for AI companies that then have to deal with average customers. That way they get to only deal with people who understand the limitations of AI. Could work. Will change the company to be less researchy though.

1

MonstarGaming t1_japbd46 wrote

>They are making money somehow

Extremely doubtful. Microsoft went in for $10B at a $29B valuation. We have seen pre-revenue companies IPO for far more than that. Microsoft's $10B deal is probably the only thing keeping them afloat.

>Hence the space is ripe for tons of competition

I think you should look up which big tech companies already offer chatbots. You'll find the space is already very competitive. Sure, they aren't large, generative language models, but they target the B2C market that ChatGPT is attempting to compete in.

1

MonstarGaming t1_japjnn4 wrote

Nice, nothing demonstrates the Dunning-Kruger effect quite like a string of insults.

For whatever its worth, that argument is exceedingly weak. I'll let you brainstorm on why that might be. I don't have interest in debating with someone who so obviously lacks tact.

−2

soobardo t1_japo5w5 wrote

Yes, they pair up perfectly. Whisper detects anything I babble to it, english or french and it's surprisingly fast. I've wrapped a loop that:

listens micro -> whisper STT -> chatgpt -> lang detect -> Google TTS -> speaker

With noise/silence detection, it's a complete hands-off experience, like chatting with a real person. Delay is ~ 5s for all calls. "Glueing" the APIs is straightforward and intuitive.

2

farmingvillein t1_japqcq1 wrote

> But wouldn't the ChatGPT embeddings still be better? Given that they're cheap, why not use the better option?

Usually, to get the best embeddings, you need to train them somewhat differently than you do a "normal" LLM. So ChatGPT may not(?) be "best" right now, for that application.

2

Bluebotlabs t1_jar58e4 wrote

Doesn't the number of tokens increase exponentially with chat history?

1

xGovernor t1_jasx7r9 wrote

You needed the secret api key, included with the plus edition. Prior to Whispers I don't believe you could obtain a secret key. Also gave early access to new features and provides me turbo day one. Also I've used to much more and got turbo to work with my plus subscription.

Had to find a workaround. Don't feel scammed. Plus I've been having too much fun with it.

1

CellWithoutCulture t1_javhjpc wrote

I mean... why were they not doing this already? They would have to code it but it seems like low hanging fruit

> memory efficient attention. 10x-20x increase in batch size.

That seems large, which paper has that?

1

LetterRip t1_javpxbv wrote

> I mean... why were they not doing this already? They would have to code it but it seems like low hanging fruit

GPT-3 came out in 2020 (they had their initial price then a modest price drop early on).

Flash attention is June of 2022.

Quantization we've only figured out how to do it fairly lossless recently (especially int4). Tim Dettmers LLM int8 is from August 2022.

https://arxiv.org/abs/2208.07339

> That seems large, which paper has that?

See

https://github.com/HazyResearch/flash-attention/raw/main/assets/flashattn_memory.jpg

>We show memory savings in this graph (note that memory footprint is the same no matter if you use dropout or masking). Memory savings are proportional to sequence length -- since standard attention has memory quadratic in sequence length, whereas FlashAttention has memory linear in sequence length. We see 10X memory savings at sequence length 2K, and 20X at 4K. As a result, FlashAttention can scale to much longer sequence lengths.

https://github.com/HazyResearch/flash-attention

1

earslap t1_jb0qamw wrote

When you feed messages into the API, there are different "roles" to tag each message ("assistant", "user", "system"). So you provide content and tell it from which "role" the content comes from. The model continues from there using the role "assistant". There is a token limit (limited by the model) so if your context exceeds that (combined token size of all roles), you'll need to inject salient context from the conversation using the appropriate role.

2

bdambrosio94563 t1_jb2ct4n wrote

I've spent the last week exploring gpt-3.5-turbo. Went back to text-davinci. (1) gpt-3.5-turbo is incredibly heavily censored. For example, good luck getting anything medical out of it other than 'consult your local medical professional'. It also is much more reluctant to play a role. (2) As is well documented, it is much more resistant to few-shot training. Since I use it in several roles, including google search information extraction and response-composition, I find it very dissappointing.

Luckily, my use case is as my personal companion / advisor / coach, so my usage is low enough I can afford text-davinci. Sure wish there was a middle-ground, though.

1

Akbartus t1_jbs0hkp wrote

Cannot agree. It is not a deal at all. Such a pricing strategy in the long term is very profitable for its creators. But it does not matter for those who would like to use it, but due to financial situation cannot afford using such APIs for a longer period of time (think about people beyond rich countries). Moreover 1k tokens can be generated in just one small talk in a matter of a few seconds...

1