sebzim4500

sebzim4500 t1_je16h58 wrote

I'm going to simplify a bit here, if you want a more complete answer I can write something up. I was planning on writing a blog post about this, because it is relevant to why ChatGPT does so much better when asked to show its working.

Basically, LLMs do not have any memory except what you see in the output. You may think that the network just needs to decode the base64 once and then use it to answer all the questions, but in actuality it needs to do it for every single token.

This is compounded by the fact that decoding base64 like this is a per-character operation, which GPT-n is especially bad at due to their choice of tokens. Since it only can use a finite amount of computation per token, wasting computation in this way will decrease the effectiveness.

Here's an example where simply making GPT-4 reverse the string makes it completely unable to do a straightforward calculation, unless you let it show its working.

2

sebzim4500 t1_je0c899 wrote

> You can ask GPT to spell a word, or provide the words as individual "S P A C E D" characters and it will similarly do poorly- it has nothing to do with tokenization. GPT is capable of spelling, it can even identify that it is not playing well if you ask if something is a good guess- but continues to give poor answers.

Yeah, because 99.99% of the time when it sees words they are not written in the way. It's true that the model can just about figure out how to break a word up into characters, but it has to work hard at that and seemingly doesn't have many layers left for completing the actual task.

I would expect that a model trained with single character tokens would do far better at these word games (wordle, hangman, etc.) at the cost of being worse at almost everything else.

2

sebzim4500 t1_jdzmpee wrote

Wordle is kind of unfair though, because the LLM takes input in the form of tokens rather than letters, so doing anything which requires reasoning on the level of letters is difficult. Incidentally, this might also be affecting it's ability to do arithmetic, LLaMA by comparison uses one token for each digit to avoid the issue (but of course suffers from the same problems with breaking words into characters).

3

sebzim4500 t1_jdzczty wrote

I think you're forcing the model to waste the lower layers on each step decoding that base64 string. Let it output the word normally, and you would probably see much better performance. Just don't look at the first output, if you want to still play it like a game.

3

sebzim4500 t1_jdeq6uo wrote

There may have been pretraining in how to use tools in general, but there is no pretraining about how to use any third party tool in particular. You just write a short description of the endpoints and it gets included in the prompt.

The fact that this apparently works so well is incredible, probably the most impressed I've been with any developement since the original ChatGPT release (which feels like a decade ago now)

12

sebzim4500 t1_jc6jye3 wrote

The company doesn't always win, sometimes the open source product is simply better. See Stable Diffusion vs DALL-E, or linux vs windows server, or lichess vs chess.com, etc.

Of course that doesn't mean it will be used more, but that isn't the point.

5