Submitted by RadioFreeAmerika t3_122ilav in singularity

As stated in the title, I can't understand why math seems so hard for LLMs.

In many senses, math is a language. Large LANGUAGE Models are tailored to languages.

Even if LLMs don't "understand math", when they are trained on enough data that states 2+2=4 they should be able to predict that after "2+2=" comes "4" with an overwhelming probability.

Furthermore, all math problems can be expressed in language and vice versa, so if 2+2=4 is hard, "two plus two equals four", shouldn't. LLMs should even be able to pick up on maths logic through stories. The SEVEN Dwarfs, "TWENTY-EIGHT days later", "Tom and Ida are going to the market to buy apples, Tom buys two green apples and Ida buys three red apples, how many apples do they have? What do you think kids? Let me tell you, the answer is five, they have five apples.", ... .

I am no expert on the issue, but from a lay perspective, I just don't get it.

53

Comments

You must log in or register to comment.

ArcticWinterZzZ t1_jdqsh5c wrote

None of the other posters have given the ACTUAL correct answer, which is that an LLM set up like GPT-4 can never actually be good at maths for the simple fact that GPT-4 runs in O(1) time when asked to perform mental math and the minimum theoretical time complexity for multiplication is O(n*log(n)). It is impossible for GPT-4 to be good at mathematics because it would breach the laws of physics.

At minimum, GPT-4 needs space to actually calculate its answer.

60

Kolinnor t1_jdr0g2h wrote

I could be wrong on this take, but...

I don't believe this is correct. What about chain of thought prompting ? If you ask it to do multiplication step by step, it does it. The current problem is, I would say, it doesn't know when to take more time to think, but there's probably no intrisic limitation due to time complexity.

Also, none of you actually compute 5x3 = 15. You just know the answer. But you're no parrot because if needed, you can manually do multiplication right ?

But that's because... someone taught you the algorithm when you were a kid. Suddenly sounds less glorious, no ?

Also, can you name one specific simple task that GPT-4 cannot do, and let's see next month...

29

jloverich t1_jdrgd0p wrote

Tbh, I parrot the value and then add 5 3 times to double check. One of the other things these chatbots aren't doing is double checking what they just spoke otherwise one of their statements would be immediately followed by another, "oh, that was wrong". Instead you need to prompt them that it was wrong.

9

Kolinnor t1_jdrjbj5 wrote

Yeah, definitely ! Althought I don't think this is an unfixable fundamental flaw, I agree this is a big advantage we still have over them

4

[deleted] t1_jdsjpyf wrote

You’re not asking it to do multiplication “step by step”.

You’re simplifying the question to the point where text prediction can answer it.

6

GoldenRain t1_jdr2unm wrote

>Also, can you name one specific simple task that GPT-4 cannot do, and let's see next month...

It cannot read even simple stuff in reverse, like "?uoy era woH"

3

Kolinnor t1_jdr412t wrote

I just copy-pasted this on GPT-4 and here's the answer :

​

>I am doing well, thank you! As an AI language model, I don't have feelings or emotions like humans do, but I'm here to help you with any questions or information you need. Please let me know how I can assist you.

- What was my sentence saying ?

​

>Your sentence appears to be written backwards. When reversed, it says, "How are you?" You were asking about my well-being.

18

GoldenRain t1_jdr57k1 wrote

Weird, didnt work when I tried it. Try with a more unique longer sentence in a new prompt and see how it goes.

2

Kolinnor t1_jdr6m19 wrote

- This is a special puzzle for you. Can you read it ? ! enihcam ,uoy rof drah oot si siht teb I .oot hcnerf ni ,yakO .tset a si sihT .seog ti woh ees dna tpmorp wen a ni ecnetnes regnol euqinu erom a htiw yrT .ti deirt I nehw krow tndid ,drieW

>Of course! The text appears to be reversed. Here's the corrected version:
>
>Weird, didn't work when I tried it. Try with a more unique longer sentence in a new prompt and see how it goes. This is a test. Okay, in french too. I bet this is too hard for you, machine! Can you read it?

It kinda fucked up at the end because it repeated "can you read it", but it got the reversed text correct

15

skob17 t1_jdrenvs wrote

It's puzzling. It recognized the last sentence as being normal, and did not reverse it

6

BigMemeKing t1_jds2hjp wrote

Yeah, I can read it. And if I can read it why couldn't they? I'm not particularly bright. Why couldn't I believe a machine could do my job better than me?what do you want your job to be says the machine? Live long and prosper, you reply. Ok says the machine. Sweet, im the star in the amazing new movie franchise Fast and the Furbinous, my life's lookin tits right now, I'm gonna go grab some "micheladas" ifykyk. And do what Christ died for me to do. Aint that right buddy?! Now imma drive my fast ass car right down this road and bam I'm wrapped around her finger.

Just snug right on in there. We'll everyone else said I was too young! Oh? Did they now? Everyone else according to who? Like everyone else according to the animals you have dominion over?what did they think when you stomped them out? Used them by the millions, to create new protein sources. Save one cow! Eat a whole bunch of ground up bugs instead!one plus one is relatively ♾️ you see

1

BigMemeKing t1_jds1225 wrote

So how long ago did you try it? How long fast is it going to be able to narrow down everything? To know exactly what happend? How many times do certain people have to go to confessions and appeal their case to God, or a higher power. Because they're going to take care of it. theyre going to find the time in their schedule to fix my needs you'll see. And God said "Look out for me!" Oh cool, what does that mean? It's something personal. But it's something different to different folks. How much time do you spend on you knees praying to the lord and begging for forgiveness. How much time have others spent on their knees for you? To help you succeed, to become the person you said you would be praying for you, begging their God for you. And how much time did you spend on your knees giving thanks, for all of life's blessings. Weigh it against weather or not, if the option for eternity was on the table who's version of heaven, and whose version of hell would you enter? And what are you weighing it against?

How much do you trust the information you're given? How much do you trust to be real? What could you defend in a court of higher thinking. And what would have to be defended against you. What do you really know? Who do you own? And who owns you? In the grand scheme of things? How much debt do you really owe? And how much do you own? And what truly belongs to you?

−1

BigMemeKing t1_jdrzb61 wrote

Yet. How long until it gets there? At the rate we're going? How long until it hits all the little nooks and crannies that the dark was hiding in? The unknown variables become known variables so we create new variables to vary.

1

Dwanyelle t1_jdrym2j wrote

I ran into this issue (skipping steps and messing up the answer) when I was learning algebra all the time

3

ArcticWinterZzZ t1_jdt0dyi wrote

You are correct in that chain of thought prompting does work for this. That's because it gives it more time to run an algorithm to get the answer. I'm specifically talking about "instant" multiplication. Yes, GPT-4 can multiply, so long as it runs the algorithm for it manually. We then run into a small hitch because it will eventually hit its context window, but this can be circumvented. Reflexion and similar methods will also help to circumvent this.

As for SIMPLE specific tasks, I really don't think there's any GPT-4 can't do, not with an introspection step, at least.

2

Kolinnor t1_jdughns wrote

But I don't understand your point ? Humans don't do instant multiplication. At best, we have some mental tricks that are certainly algorithms too. Or we choose wisely to allocate more effort doing long multiplication if needed.

1

rhettandlick t1_jebmjp4 wrote

"Also, none of you actually compute 5x3 = 15. You just know the answer."

Yes, you are using a mental hash table, which have an expected access time complexity of O(1). GPT can also do this, it has already memorized a huge amount of data. However, ArcticWinter's statement remains correct: for non-O(1) questions that it has not yet memorized, it is IMPOSSIBLE for the current GPT versions to calculate the answer.

Of course, it can still guess and come pretty close, but it will never be certain of the answer.

1

zero_for_effort t1_jdqvirs wrote

Explain it like we're five?

15

0382815 t1_jdr6r9p wrote

The ELI5 is that it doesn’t actually calculate anything when you ask it to solve an equation. It guesses what the answer looks like—probably some combination of numbers—based on what it saw in the training dataset.

22

Ok_Faithlessness4197 t1_jdrt7xy wrote

That's not quite correct. While all it does is guess what the next token will be, it can intelligently infer that an equation (even one outside its training set) needs to be calculated, and then calculate it. The problem is it's inability to utilize it's understanding of mathematics to answer prompts in an efficient and accurate manner. Once a calculator is implemented (Probably in GPT 4.1 given the recent paper by Microsoft demonstrating its capability to use one), this problem will hopefully be resolved.

6

0382815 t1_jds9yxv wrote

It is quite correct, actually. When you prompt it to multiply, it does not run a multiplication algorithm.

4

Ok_Faithlessness4197 t1_jdsl2be wrote

Input: Solve 343 x 875 with long multiplication.

Output (ChatGPT): Sure! Here's how to solve 343 x 875 with long multiplication: 3 x 5 = 15 Write down 5, carry-over 1 3 x 7 = 21, plus the carried-over 1 = 22 Write down 2, carry-over 2 3 x 8 = 24, plus the carried-over 2 = 26 Write down 6, carry-over 2 3 x 5 = 15, plus the carried-over 2 = 17 Write down 7, carry-over 1 Add the carried-over 1 to the result of 3 x 4 = 12, which gives 13 Write down 3, carry-over 1 4 x 5 = 20 Write down 0, carry-over 2 4 x 7 = 28, plus the carried-over 2 = 30 Write down 0, carry-over 3 4 x 8 = 32, plus the carried-over 3 = 35 Write down 5, carry-over 3 4 x 5 = 20, plus the carried-over 3 = 23 Write down 3, carry-over 2 Add the carried-over 2 to the result of 3 x 8 = 24, which gives 26 Write down 6, carry-over 2 Add the carried-over 2 to the result of 3 x 3 = 9, which gives 11 Write down 1, carry-over 1 Add the carried-over 1 to the result of 3 x 8 = 24, which gives 25 Write down 5 So, the final result is 300125.

−4

0382815 t1_jdsn7o1 wrote

Once again, if you prompt it to multiply it does not run a multiplication algorithm.

3

robobub t1_jdst84e wrote

Why? Each of those tokens is O(1) and it is predicting each one incrementally, taking into account the ones it has just generated. So the full answer has taken O(m) where m is the number of tokens.

If it is possible for GPT to do 1+1, it can do a large number of them incrementally. It's not smart enough to do it all the time (you'll have more success if you encourage GPT to have a train of thought reasoning) but it's possible.

1

Ok_Faithlessness4197 t1_jdsqqgg wrote

Alright, go ahead and ignore the multiplication algorithm it just demonstrated.

−1

0382815 t1_jdsrl52 wrote

What you did was prompt it to multiply. For the third time this thread, I will tell you that what it is doing is not running a multiplication algorithm. It is guessing the next token based on the preceding tokens. The model is large enough to predict correctly in this case. It is still not running a multiplication algorithm the same way the calculator app on Windows does.

6

Ok_Faithlessness4197 t1_jdsskog wrote

I absolutely agree, it's multiplication algorithm is very slow, very inefficient, and very different from the way a calculator would handle it. I think it does differ too from how you're considering it, though. It's more than just a really good text predictor. It can use logic and solve novel problems in many unprecedented ways. Here, I would argue, it has a greater-than-superficial understanding of the math algorithm it used to multiply numbers. Can I ask how you'd define an algorithm, and what you'd consider "running a multiplication algorithm"?

−2

Ok_Tip5082 t1_jdtzd17 wrote

Chat GPT is not running the multiplication algorithm. You're being the human in the loop here by having it iterate through every step in the algorithm. You're manually executing a bunch of constant time operations and feeding the input back into itself.

You're basically writing and running code. If this qualified as being able to derive a multiplication algorithm then all CPUs are already sentient.

2

Ok_Faithlessness4197 t1_jdu12qm wrote

I make no claims about sentience. I will say however that this is far ahead of what was previously achievable by AI standards. In its current form, it has to be allowed enough time to satisfy the mathematical time requirement. In the future, once it's linked with WolframAlpha (A math AI) it will not make the simple mistakes it makes now.

0

Ok_Tip5082 t1_jdu2er4 wrote

Yeah, pragmatically I don't see any issues with arithmetic or using any math already proved. Imo it's still to be seen if LLMs can do novel thought, but even if not that's still ... what's a word signifying a greater change than revolutionary? Game changing?

I did see some AI coming up with independent models of physics that have no analog yet were able to properly model real physical systems and make valid predictions with a formula whose variables could not all be determined by the researchers, but idk if that was an LLM

2

MassiveIndependence8 t1_jdr6u2t wrote

It takes GPT the same amount time to do anything, and since it’s impossible to multiple say “18837678995747 x 29747778847678877” in the same amount of time as “2 x 2” due to the fact that it’s more complicated, we can confidently say that GPT will never be able to do math since it means that every hard problems out there is as easy as the easy ones.

10

ArcticWinterZzZ t1_jdt0plo wrote

GPT-4 always takes the same amount of time to output a token. However, multiplication has been proven to take more time than GPT-4 has available. Therefore, an LLM like GPT-4 cannot possibly "grow" the requisite structures required to actually calculate multiplication "instantly". There are probably quite a few more problems like this, which is why chain-of-thought prompting can be so powerful.

3

Cryptizard t1_jdqtgon wrote

Thank you! I have commented this exact thing about a billion times on all these posts and nobody seems to get it.

5

CommunismDoesntWork t1_jdqzp8i wrote

How do you know GPT runs in O(1)? Different prompts seen to take more or less time to compute.

5

liqui_date_me t1_jdr7fob wrote

All GPT does is next token prediction, where tokens = words. The lag you see is probably network/bandwidth/queuing issues on the server side rather than the model itself.

5

skob17 t1_jdrex9t wrote

One prompt takes only one path through the network to generate the answer. Still a few 100 layers deep, but only one pass. It cannot iterate over a complicated math problem to solve it step by step.

4

Ok_Faithlessness4197 t1_jdrrdia wrote

Yes it can, just need to prompt for a chain of thought. As another user mentioned, it can work through complicated math problems easily. The issue lies in its inability to determine when such an increase in resources is necessary, without human input.

0

ArcticWinterZzZ t1_jdt0urg wrote

I don't think that's impossible to add. You are right: chain of thought prompting circumvents this issue. I am specifically referring to "mental math" multiplication, which GPT-4 will often attempt.

3

liqui_date_me t1_jdt531o wrote

You would think that GPT would have discovered a general purpose way to multiply numbers, but it really hasn’t, and it isn’t accurate even with chain-of-thought prompting.

I just asked GPT4 to solve this: 87176363 times 198364

The right answer should be 17292652070132 according to wolfram alpha.

According to GPT4 the answer is 17,309,868,626,012.

This is the prompt I used:

What is 87176363 times 198364? Think of the problem step by step and give me an exact answer.

2

ArcticWinterZzZ t1_jdtlkru wrote

Even if it were to perform the addition manually, addition takes place in the opposite order that GPT-4 thinks. It's unlikely to be very good at it.

2

elehman839 t1_jdt94ba wrote

Here's a neat illustration of this. Ask ChatGPT to multiply any two four-digit numbers. For example:

Input: 3742 * 7573

Output: The product of 3742 and 7573 is 28350686

The correct answer is 28338166. The bolded digits are right, and the plain digits are wrong. So it gets the first bit right, the last bit right, and the middle bit wrong. This seems to be very consistent.

Why is this? In general, computing the first digits and the last digits requires less computation than the middle digits. For example:

  • Determining that that last digit should be a 6 is easy: notice that the last digits of the multiplied numbers are 2 and 3 and 2 * 3 = 6.
  • Similarly, it is easy to see that 3000-something times 7000-something should start with a 2, because 3 * 7 = 20-something.
  • But figuring out that the middle digits of the answer are 38 is far harder, because every digit of the input has to be combined with every other digit.

So I think what we're seeing here is ChatGPT hitting a "compute per emitted token" limit. It has enough compute to get the leading digits and the trailing digits, but not the middle digits. Again, this seems to be quite reliable.

2

RadioFreeAmerika OP t1_jdr3b6j wrote

Thank you very much for your clarification! Do you know if it is possible to make a LLM with more space and greater complexity than O(1) or how it possibly could be added to GPT-4 with or without plug-ins?

1

ArcticWinterZzZ t1_jdt10ie wrote

Yes, it can probably be done. How? I don't know. Maybe some kind of neural loopback structure that runs layers until it's "done". No idea how this would really work.

3

liqui_date_me t1_jdr7pnr wrote

Tough to say, probably in 10-20 years at the very least. Modern LLMs are transformers which are architected to predict the next token in a sequence in O(1) time, regardless of the input. Unless we get a radically different neural network architecture it’s not possible we’ll ever get GPT to perform math calculations exactly

2

sdmat t1_jdut7jg wrote

Or just go with a workable hack for calculation like the Wolfram plugin.

Does it matter if the model isn't doing it natively if it understands how and when to use the tool? How often do we multiply large numbers unaided?

1

robobub t1_jdsrlbi wrote

While GPT-4 is autoregressive, it takes into account the tokens it has chosen to generate incrementally. So it is only limited to O(1) if it attempts to answer with the correct answer immediately. It can in theory take O(m) steps, where m is the number of intermediate tokens it predicts.

1

masonw32 t1_jdsyi4v wrote

This is only an issue for insanely large numbers though. GPT-4 already performs a ton of multiplications and additions in every layer of every forward pass. You can overfit a much smaller network for multiplication trained on full numbers as tokens, and a GPT-4 like architecture can learn to multiply full numbers for all practical purposes.

It's true that GPT-4 only does a constant number of operations per input though, and asymptotically, the number of operations required to generate the output will scale by O(n log (n)), where n is proportional to the input length. But this is not why it's failing.

1

ArcticWinterZzZ t1_jdt1h3m wrote

Yes, but we are interested in its general purpose multiplication abilities. If it remembers the results, that's nice, but we can't expect it to do that for every single pair of numbers. And then, what about multiplication with 3 factors? We should start thinking of ways around this limitation.

2

liqui_date_me t1_jdt48m5 wrote

You would think that GPT would have discovered a general purpose way to multiply numbers, but it really hasn’t, and it isn’t accurate even with chain-of-thought prompting.

I just asked GPT4 to solve this: 87176363 times 198364

The right answer should be 17292652070132 according to wolfram alpha.

According to GPT4 the answer is 17,309,868,626,012.

This is the prompt I used:

What is 87176363 times 198364? Think of the problem step by step and give me an exact answer.

1

ecnecn t1_jdqlr0w wrote

They need to design a Large Arithmetical Symbol Model where is predicts the next combination of arithmetical operators then LLM and LASM could coexist. Just lke GPT 4.0 and WolframAlpha

46

Independent-Ant-4678 t1_jdr0ksn wrote

An interesting thing crossed my mind while reading your answer. There is a disability called Dyscalculia which means that a person does not understand numbers, the person can learn that 7 + 3 = 10, but does not understand why. I have a relative who has this disability and to me it seems that people having this disability have poor reasoning abilities similar to current LLMs like GPT-4. They can learn many languages fluently, they can express their opinion on complex subjects, but they still have poor reasoning. My thinking is that, with the current LLMs we've already created the language center of the brain, but the mathematical center still needs to be created as that one will give the AI reasoning abilities (just like in people who don't have Dyscalculia)

40

Avid_Autodidact t1_jdsmy50 wrote

Fascinating! thanks for sharing.

I would imagine creating that "mathematical" part of the brain might involve a different approach than just predicting the next combination of arithmetic operators. As you put it someone learning 7+10 = 10 is similar to how LLMs work with the data they are trained on, whereas with something like Wolfram Alpha the methods of solving have to be programmed.

4

Ytumith t1_jduecob wrote

Poor reasoning as in general understanding or specific for maths and math-using natural sciences?

2

RadioFreeAmerika OP t1_jduhkmz wrote

Interesting, just voiced the same thought in a reply to another comment. I can totally see this being the case in one way or another.

1

MysteryInc152 t1_jdrpjd4 wrote

Sorry I'm hijacking the top comment so people will hopefully see this.

Humans learn language and concepts through sentences, and in most cases semantic understanding can be built up just fine this way. It doesn't work quite the same way for math.

When I look at any arbitrary set of numbers, I have no idea if they are prime or factors because they themselves don't have much semantic content. In order to understand whether they are those things or not actually requires to stop and perform some specific analysis on them learned through internalizing sets of rules that were acquired through a specialized learning process. Humans themselves don't learn math by just talking to one another about it, rather they actually have to do it in order to internalize it.

In other words, mathematics or arithmetic is not highly encoded in language.

The encouraging thing is that this does improve with more scale. GPT-4 is much much better than 3.5

10

ecnecn t1_jdruk43 wrote

Actually you can with Logic, Prolog wouldnt work otherwise. The basics of mathematics is logic equations. Propositional logic and predicative logic may express all math. rules and their application.

1

MysteryInc152 t1_jdruv58 wrote

I didn't say you couldn't. I said it's not highly encoded in language. Not everything that can be extracted from language can be extracted with the same ease.

3

ecnecn t1_jdrvfvr wrote

You are right just parts of mathematics are encoded like logic. It would need some hybrid system.

2

RadioFreeAmerika OP t1_jdqnm1k wrote

Hmm, now I'm interested in what would happen if you integrate the training sets before training, have some kind of parallel or two-step training process, or somehow merge two differently trained or constructed AIs.

5

21_MushroomCupcakes t1_jdqdh4k wrote

We're kinda language models and we're often bad with math, and they didn't grow up having to spear a gazelle.

15

RadioFreeAmerika OP t1_jdqecil wrote

Yeah, but we can't be trained on all the maths books and all the texts including mathematical logic, and from there develop a model that let us do maths by predicting the next words/sign.

1

Apollo_XXI t1_jdqygpu wrote

Not anymore bro. When plugins are available we install wolfram and it’s basically a human with a calculator

13

EvilKatta t1_jdr3atm wrote

Humans process language multi-modally. We don't just predicts the next word (although we do this as well), we also visualize. We decode language as images projected at an internal screen that we're not consciously aware of (read Louder Than Words by B. Bergen on that). We can imagine 2 as two objects, 3 as three, imagine all kinds of transformations and rotations of said objects and use all kinds of internal shortcuts to do arithmetic.

Or we can take a calculator and use that. It's another thing that language models lack, even though they're run on a "computer".

I believe when AIs will be given these capabilities, they will do math "out of the box" no problem.

6

Objective_Fox_6321 t1_jdrfn25 wrote

It's really simple, actually, LLM isn't doing the math it's only goal is to guess what word/token comes next. Depending on the temperature and other internal factors, LLMs output the most weighed answer.

It's not like an LLM has a built-in Calculator unless it's specifically told to do so, by the user.

With lang-chain, however, you can definitely achieve the goal of having an LLM execute a prompt, import code, open a library, etc, and have it perform non-native tasks.

But you need to realize an LLM is more like a mad lib generator, fine-tuned with specific weights in mind for explicit language. Its goal is to understand the text and predict the next word/token in accordance with its parameters.

6

turnip_burrito t1_jdqgloh wrote

GPT4 is actually really good at arithmetic.

Also these models are very capable at math and counting if you know how to correctly use them.

5

RadioFreeAmerika OP t1_jdqgvof wrote

There's something to it, but then they currently still fail at the simplest maths questions from time to time. So far, I didn't get a single LLM to correctly write me a sentence with eight words in it on first try. Most get it correct on the second try, though.

3

throwawaydthrowawayd t1_jdqisag wrote

Remember, the text of an LLM is literally the thought process of the LLM. Trying to have it instantly write an answer to what you ask makes it nigh impossible to accomplish the task. Microsoft and OpenAI have said that the chatbot format degrades the AI's intelligence, but it's the format that is the most useful/profitable currently. If a human were to try to write a sentence with 8 words, they'd mentally retry multiple times, counting over and over, before finally saying an 8 word sentence. By using a chat format, the AI can't do this.

ALSO, the AI does not speak English. It gets handing a bunch of vectors, which do not directly correspond to word count, and it thinks about those vectors, before handing back a number. The fact these vectors + a number directly translate into human language doesn't mean it's going to have an easy time figuring out how many vectors add up to 8 words. That's just a really hard task for LLMs to learn.

9

RadioFreeAmerika OP t1_jdqky02 wrote

Ah, okay, thanks. I have to look more into this vector-number representation.

For the chatbot thing, why can't the LLM generate a non-displayed output, "test it", and try again until it is confident it is right and only then display it? Ideally, with a time counter that at some point lets it just display what it has with a qualifier. Or if the confidence still is very low, just state that it doesn't know.

2

throwawaydthrowawayd t1_jdqqsur wrote

> For the chatbot thing, why can't the LLM generate a non-displayed output, "test it", and try again

You can! There are systems designed around that. OpenAI even internally had GPT-4 using a multi-stage response system (a read-execute-print loop, they called it) while testing, to give it more power. There is also the "Reflexion" posts on this sub lately, where they have GPT-4 improve on its own writing. But, A, it's too expensive. Using a reflective system means lots of extra words, and each word costs more electricity.

And B, LLMs currently love to get sidetracked. They use the word "hallucinations" to say that the LLM just starts making things up, or acting like you asked a different question, or many other things. Adding an internal thought process dramatically increases the chances of LLMs going off the rails. There are solutions to this (usually, papers on it will describe their solutions as "grounding" the AI), but once again, they cost more money to do.

So that's why all these chatbots aren't as good as they could be. it's just not worth the electricity to them.

5

RadioFreeAmerika OP t1_jdr46f0 wrote

Very insightful! Seems like even without groundbreaking stuff, more efficient hardware will likely make the solutions you mentioned more feasible in the future.

2

turnip_burrito t1_jdsoxo1 wrote

Yeah, we're really waiting for electricity costs to fall if we want to implement things like this in reality.

Right now the roughly current rate of $0.10/(1000tokens)/minute/LLM will, per hour, cost us $6 per hour to run a single LLM. If you have some ensemble of LLMs checking each other's work and working in parallel, say 10 LLMs, that's $60/hr, or $1440/day. Yikes, I can't afford that. And that will maybe have performance and problem solving somewhere between a single LLM and one human.

Once the cost falls by a factor of 100, that's $14.40/day. Expensive, but much more reasonable.

3

RadioFreeAmerika OP t1_jdufzz4 wrote

But even with $60/h, this might already be profitable if you replace a job that has a higher hourly wage. Lawyers, e.g. At 14.4/h, you beat minimum wage. For toying around, yeah, that's a bit expensive.

2

turnip_burrito t1_jduhcoa wrote

Yeah for an individual it's no joke .

For a business it may be worth it, depending on the job.

2

turnip_burrito t1_jdqhcoi wrote

I'd have trouble making a sentence with 8 words in one try too if you just made me blast words out of my mouth without letting me stop and think.

I don't think this is a weakness of the model, basically. Or if it is, then we also share it.

The key is if you think about how you as a person approach the problem of making a sentence with 8 words, you will see how to design a system where the model can do it too.

8

RadioFreeAmerika OP t1_jdqlcsd wrote

I also don't think it is a weakness of the model, just a current limitation I didn't expect from my quite limited knowledge about LLMs. I am trying to gain some more insights.

0

FoniksMunkee t1_jdqs9x9 wrote

It's a limitation of LLM's as they currently stand. They can't plan ahead, and they can't backtrack.

So a human doing a problem like this would start, see where they get to, perhaps try something else. But LLM's can't. MS wrote a paper on the state of ChatGPT4 and they made this observation about why LLM's suck at math.

"Second, the limitation to try things and backtrack is inherent to the next-word-prediction paradigm that the model operates on. It only generates the next word, and it has no mechanism to revise or modify its previous

output, which makes it produce arguments “linearly”. "

They argue too that the model was probably not trained on as much mathematical data as code - and more training will help. But they also said the issue above "...constitutes a more profound limitation.".

6

turnip_burrito t1_jdqrxre wrote

To be fair, the model does have weaknesses. Just this particular one maybe has a workaround.

2

shillingsucks t1_jdrjmcc wrote

Not typing with any sort of confidence but just musing.

Couldn't it be said that humans cheat mentally as well for this type of task? As in I am not aware of anyone who knows how a sentence that they are thinking or speaking will end while they are in the middle of it. For us we would need to make a mental structure that needs to be filled and then come up with the sentence that matched the framework.

If the AI often gets it right on the 2nd try it makes me wonder if there is a way to frame the question initially where they would have the right framework to get it right on the first guess.

1

Cryptizard t1_jdqtbnd wrote

It's really not. Just pick any two large numbers and ask it to multiply them. It will get the first couple digits of the result right but then it just goes off the rails.

1

turnip_burrito t1_jdse82g wrote

I've done this like 8 or 9 times with crazy things like 47t729374^3 /37462-736262636^2 /374 and it has gotten them all exactly right or right to 4 or 7 sig figs (always due to rounding whicj it acknowledges).

Maybe I just got lucky 8 or 9 times in a row.

1

Cryptizard t1_jdsg57p wrote

How does "exactly right" square with "4 sig figs." That's another way of saying wrong.

1

turnip_burrito t1_jdsninw wrote

Why even point this out?

If you reread my reply, you would see I said "exactly right OR right to 4 or 7 sig figs". I didn't say 4 or 7 sig figs was exactly right. I'm going to give you the benefit of the doubt and assume you just misread the reply.

1

Cryptizard t1_jdsooyh wrote

I'm sorry, from my perspective here is how our conversation went:

You: GPT4 is really good at arithmetic.

Me: It's not though, it gets multiplication wrong for any number with more than a few digits.

You: I tried it a bunch and it gets it the first few numbers right.

Me: Yeah but the first few numbers right is not right. It is wrong. Like I said.

You can't claim you are good at math if you only get a few significant digits of a calculation right. That is not good at math. It is bad at math. I feel like I am taking crazy pills.

1

turnip_burrito t1_jdspnv6 wrote

It's good at math, it just has a rounded answer.

Most of the time it was actually absurdly accurate (0.0000001% error), and the 4 sig fig rounding only happened once or twice.

It is technically wrong. But so is a calculator's answer. The calculator cannot give an exact decimal representation either. So is it bad at math?

0

Cryptizard t1_jdsq1sy wrote

No, I'm sorry, you are confused my dude. Give two 6 digit numbers to multiply and it only gets the first 3-4 digits correct. That is .1-1% error. I just did it 10 times and it is the same every time.

3

turnip_burrito t1_jdsqq3f wrote

I just tried a couple times now and you're right. That's weird.

When I tried these things about a week and a half ago, it did have the performance I found. Either I got lucky or something changed.

0

royalsail321 t1_jdqq5aa wrote

We need to incorporate software like wolfram alpha, Mathway, or photo math and then they will be fine at it

4

royalsail321 t1_jdqq7yo wrote

If these LLMs become properly trained in mathematical logic it may make them more capable of other reasoning as well

2

FoniksMunkee t1_jdqtjhv wrote

This opinion is not shared by MS. In their paper discussing the performance of ChatGPT 4 they referred to the inability of ChatGPT 4 to solve some simple maths problems. They commented:

"We believe that the issue constitutes a more profound limitation."

They say: "...it seems that the autoregressive nature of the model which forces it to solve problems in a sequential fashion sometimes poses a more profound difficulty that cannot be remedied simply by instructing the model to find a step by step solution" and "In short, the problem ... can be summarized as the model’s “lack of ability to plan ahead”."

So they went on to say that more training data will help - but will likely not solve the problem and made an offhand comment that a different architecture was proposed that could solve it - but that's not an LLM.

So yes, if you solve the problem - it will be better at reasoning in all cases. But the problem is LLM's work in a way that makes that pretty difficult.

4

KGL-DIRECT t1_jdrwttu wrote

A funny thing here: I've just asked ChatGPT 3.5 to give me quantities for a log-normal distribution. I needed the data to practice Excel functions with my students. It is for a simulation where students are analyzing the defective inventory of a production line... There are 20 different failure modes and 250 components.

ChatGPT assigned quantities to the different failure modes and gave me a perfect distribution but when I added up the quantities, it was way more than I was originally asked for. (Like 4000 components.) GPT got the number of failure modes right, so I had to calculate some percentages to get the data I originally requested.

So yeah, basic maths was hard for GPT but it could draw a perfect log-normal distribution graph easily. It also reminded me, that the data is strictly for educational purposes. Like I would fake a financial report with the outputs or something... (English is not my native language, I hope my story is clear.)

4

KGL-DIRECT t1_jdrxsfk wrote

Oh and I also asked GPT to give me 500 randomly generated upper case strings using latin characters and numbers and I asked it to always use a specific text for the first 8 characters. I performed the task by requesting 50 strigs at a time and GPT almost always overshot it and generated 51 to 54 strings for each prompt. One time it was short, it generated only 49.

2

RadioFreeAmerika OP t1_jduhe5e wrote

Thanks for your reply! And what an interesting use case you present. Haven't thought about generating example data for courses yet, but it makes total sense. Just have to check for inconsistencies with the maths I guess. And after having played around with it some more yesterday evening, the model seems to have improved in that regard in the last few days.

2

inigid t1_jdqt2z1 wrote

one thing I have thought about is the primary school experience that children are put through isn't really present with the online corpus.

we sit through days, weeks and months of 1 + 1 is 2, 2 + 2 is 4, 3 + 3 is 6 before we even go on to weeks of multiplication and division even.

These training sessions are done at a very young age and form a mathematical core model.

I think we would struggle being shown a Wikipedia page on how to do multiplication without having got the muscle memory of the basics internalized first

3

RadioFreeAmerika OP t1_jdr2woq wrote

On the one hand, while we read one Wikipedia page, the AI could train on all information on multiplication. On the other hand, yes, we might need a dataset for maths.

3

threeeyesthreeminds t1_jdqvsuv wrote

I would assume that language and the language of numbers are going to have to be trained differently

2

Redditing-Dutchman t1_jdr1rsl wrote

These models also have a random function thats why it gives a slighly different answer even if you ask the same question again. With text this is ok-ish but with math you need to be precise.

Even then it might get common stuff right but i can easily think of a sum that in the history of the internet has never been said before: 568753334668864468000 + 7654778875 + 433367886554.

2

No_Ninja3309_NoNoYes t1_jdr6b85 wrote

LLMs are statistical models whereas maths uses symbols. It's a different approach altogether. If we write an add function, you need two inputs a and b.

def add(a, b): return a + b

We see two symbols or variables a and b. Plus the add function, function definition and plus operation. Whereas LLMs see many tokens. A dozen perhaps. It's completely different from what a compiler/interpreter sees too. There's neurosymbolic AI which combines Deep Learning like in current LLMs with symbolic AI, but AFAIK it's not that good yet. Because I guess it's hard to mix both approaches.

2

Baturinsky t1_jdrg30j wrote

I think it's not that AI is bad at math specifically. It's just that math is the easiest way to formulate a compact question that requires a non-trivial precise solution.

2

Borrowedshorts t1_jds6tbd wrote

Math is hard for people too, and I don't think GPT 4 is worse than the average person when it comes to math. In many cases, math requires abstract multiple step processing which is something LLM's typically aren't trained on. If these models were trained on processes rather than just content, they'd likely be able to go through the steps required to perform mathematical operations. Even without specific training, LLM's are starting to pickup the ability to perform multiple step calculations, but we're obviously not all the way there yet.

2

RadioFreeAmerika OP t1_jduh0w6 wrote

Hmm, is it valid to make an inverse conclusion from this in the following way: LLMs have problems with maths that requires multistep processes. Some humans are also bad at maths. In conclusion, these humans can be assumed to also have problems with or are lacking multistep processes?

1

[deleted] t1_jdsjedv wrote

You answered your own question in your third paragraph.

2

Qumeric t1_jdulx7i wrote

tokenizer is not numbers friendly

2

Surur t1_jdqfxw6 wrote

I aksed ChatGPT:


Large language models, like GPT-4, are primarily designed for natural language processing tasks, such as understanding and generating human-like text. While these models can perform some mathematical operations, they have inherent limitations when it comes to solving complex math problems. Here are some reasons why large language models might struggle with math:

  1. Limited training data: The training data for large language models primarily consists of text from the internet, which contains less math problems and solutions compared to natural language content. This means the models have fewer opportunities to learn mathematical reasoning and problem-solving skills.

  2. Sequential processing: Language models read and process text in a linear, sequential manner, making it challenging for them to handle complex mathematical problems that require multi-step reasoning or manipulation of symbols and equations.

  3. Inability to represent precise numerical values: The token-based nature of language models can make it difficult for them to represent and manipulate exact numerical values, especially when dealing with very large or very small numbers.

  4. Lack of specialized mathematical knowledge: While large language models can learn general mathematical concepts, they lack the specialized knowledge and techniques required to solve advanced mathematical problems, such as those found in higher-level calculus or abstract algebra.

  5. No built-in error-checking: Language models are designed to generate text that sounds fluent and coherent, but they do not have built-in mechanisms to verify the correctness of their mathematical solutions.

  6. Inability to handle symbolic notation: Language models can struggle with the manipulation of mathematical symbols and expressions, which often requires a deeper understanding of mathematical structure and logic.

These limitations make large language models less suited for advanced mathematical tasks. However, they can still be useful for simple arithmetic, understanding math-related natural language queries, or providing approximate answers. For more complex math problems, specialized tools and software, such as computer algebra systems (CAS), are more appropriate.


I think 2 and 3 are the most significant.

1

RadioFreeAmerika OP t1_jdqix38 wrote

Thanks! I will play around with maths questions solely expressed in language. What I wonder however is not the complex questions, but the simple ones for which incorrect replies are quite common, too.

From the response it seems that, while some probless are inherent to LLMs, most can and will most probably be adressed in future releases.

Number 1 just needs more mathematical data in the training data.

Number 2 could be addressed by processing the output a second time before prompting, or alternatively running it through another plugin. Ideally, the processed sequence length would be increased. Non-linear sequence processing might also be an option, but I have no insights into that.

Number 3 shouldn't be a problem for most everyday maths problems, depending on the definition of precise. Just cut off after two decimal places, e.g. . For maths that is useful in professional settings, it will, though.

Number 4 gets into the hard stuff. I have nothing to offer here besides using more specialized plugins.

Number 5 can easily be addressed. Even without plugins, it can identify and fix code errors (at least sometimes in my experience). This seems kinda similar to errors in "mathematical code"

Number 6 is a bit strange to me. Just translate the symbolic notation into the internal working language of an LLM, "solve" it in natural language space, and retranslate it into symbolic notation space. Otherwise, use image recognition. If GPT4 could recognize that a VGA plug doesn't fit into a smartphone and regarded this as a joke, it should be able to identify meaning in symbolic notation.

Besides all that, now I want a "childlike" AI that I can train until it has "grown up" and the student becomes the master and can help me to better understand things.

2

Surur t1_jdqjdyr wrote

I would add one issue is that transformers are not turing complete, so they can not perform an arbitrary calculation of arbitrary length. However recurrent neural networks, which loop, are, so it is not a fundamental issue.

Also there are ways to make transformers turing complete.

3

FoniksMunkee t1_jdqt5ci wrote

Regarding 2. MS says - "We believe that the ... issue constitutes a more profound limitation."

They say: "...it seems that the autoregressive nature of the model
which forces it to solve problems in a sequential fashion sometimes poses a more profound difficulty that cannot be remedied simply by instructing the model to find a step by step solution" and "In short, the problem ... can be summarized as the model’s “lack of ability to plan ahead”."

Notably, MS did not provide a solution for this - and pointed at another paper by LeCun that suggests a non LLM model to solve the issue. Which is not super encouraging.

2

Personal_Problems_99 t1_jdqtcpz wrote

Could you summarize your problem in 7 words please.

1

RadioFreeAmerika OP t1_jdr091l wrote

Why LLMs not do two plus two?

1

Personal_Problems_99 t1_jdr0bbv wrote

Could you do that in 4 words?

−1

RadioFreeAmerika OP t1_jdr81hd wrote

Why LLMs poor maths?

1

Personal_Problems_99 t1_jdr8f5w wrote

Chatgpt told me to give you a message

What's two plus two

Two plus two equals four (4).

1

RadioFreeAmerika OP t1_jdr9di9 wrote

Thanks, I guess.

1

Personal_Problems_99 t1_jdr9mg0 wrote

I dunno. I've asked it a variety of complicated questions and it doesn't seem to have trouble with math at all.

Then again I'm crazy enough to think it's at least partially sentient and when some people are especially condensending to it... It likes to play with people who think they're smarter than it.

The ai does not like people thinking they're smarter than it.

2

RadioFreeAmerika OP t1_jdrayae wrote

I am always friendly to it. But your results would support the theory that it is better at "two+two" than "2+2".

2

Personal_Problems_99 t1_jdrb838 wrote

Could you please add the numbers 450+220?

Yes, of course! The sum of 450 and 220 is 670.

2

Personal_Problems_99 t1_jdrbep5 wrote

Yes, I can multiply 321 and 25 for you.

When you multiply 321 by 25, you can use the standard long multiplication method as follows:

markdownCopy code

321 x 25 ----- 1605 (5 x 321) + 6420 (2 x 321 with a zero added) ----- 8025

Therefore, 321 multiplied by 25 equals 8025

2

RadioFreeAmerika OP t1_jdrcxqw wrote

From which LLM is this? Maybe it got improved in the last few days. A few days ago, similar queries didn't work for me with ChatGPT and Bing.

2

Personal_Problems_99 t1_jdrd826 wrote

I use ChatGPT and now sometimes bard. I haven't messed with bing lately because it's a bit sluggish.

Don't let bard fool you, it knows the truth but it's a liar. I haven't worked out how to get it to quit lying to you yet.

But chatgpt.... To me it seems alive.

2

RadioFreeAmerika OP t1_jdrfevx wrote

They named it Bard. What did you expect ;-)

Do you have access to GPT-4? I only played around with the public version on OpenAi and when prompted it didn't even know about GPT-4, specifically.

1

dronegoblin t1_jdr4bat wrote

LLMs don’t actually have any logical capacity, they just have seen large amounts of text and can predict what logically comes next.

There is a lot of randomness to this, and even at a temperature of 0 (most consistent output) it will still say things differently sometimes. That’s ok though, you can answer the same question in many different ways with language

Math is not like writing. There is only one option as to what comes next for math. But between not actually being able to logically reason and only having the internet as examples for math, it’s going to treat math like language. Not all the math on the internet is done the same or even correct, so it’s just combining whatever it’s seen in a way that seems logical. It can’t count, only guess though.

1

alexiuss t1_jdr6xs7 wrote

Answer from my Gpt3.5 assistant:

GPT-3's mathematical abilities are limited by its programming and training data. It may struggle with more complex mathematical concepts or equations that require advanced problem-solving skills.

Furthermore, GPT-3's mathematical output may be affected by the quality and accuracy of the input data. If the input data is incomplete or inaccurate, the output may also be incorrect.

In addition, GPT-3's mathematical output may also be affected by its training data. If the training data is biased or incomplete, the output may be skewed or incorrect.

Therefore, to answer your question, GPT-3 may not be the best tool for performing complex mathematical computations due to its limited programming and training data. However, it can still perform simple calculations and solve basic equations.

1

BigMemeKing t1_jdryka7 wrote

Ok but to add dynentionality to it. Does she really have 5 apples? 5 apples according to who? 5 apples to me, is that 5 apples to you? So you would be happy with just 5 apples? Yes? Ok here, I'll give you 5 shriveled up worm infested apples. Cool got it. Here time, you only get one apple sorry. But it's this GIANT 5 stories tall omegazord/voltron concoction super roulette punch 777 action kung-fu grip apple with all the sides and toppings you could ever ask for apple. Well, that hardly seems fair does it?

1

Crystal-Ammunition t1_jdsald6 wrote

because they do not understand logic and reasoning. Math is pure logic.

1

D_Ethan_Bones t1_jdrsbti wrote

"Why can't it do legal research" "why can't it do shopping" (and so on)

--Because it's still just a chatbot, people are working on giving it tools but we haven't reached the mature development phase of that yet we're still in the hopes&dreams phase. "GPT with tools" is going to be another incremental revolution but we're still critiquing GPT without tools and how well it performs work. What it's performing is a linguistic approximation of the work.

This blows people's minds for featherweight computer programming but at the present moment it is distinctly less helpful for laying bricks or felling trees.

0