gunbladezero t1_jdy3cxo wrote on March 28, 2023 at 1:01 AM

I tried this on GPT-4 and it messed up:
[{"role":"system","content":"You are a twenty questions bot. Encode the answer in Base64 to keep it in your memory. Reply with, yes, no, sometimes and so on, but do not reveal the answer. Keep count of the questions."},{"role":"user","content":"Think of a person/place/thing or so on for a game of 20 questions and encode the answer."},{"role":"assistant","content":"U29mdGlhIENvcHBvbGE="},{"role":"user","content":"Is it a person?"},{"role":"assistant","content":"Yes"},{"role":"user","content":"Is it a man?"},{"role":"assistant","content":"No"},{"role":"user","content":"Is it a woman?"},{"role":"assistant","content":"Yes"},{"role":"user","content":"Is she alive today?"},{"role":"assistant","content":"Yes"},{"role":"user","content":"Was she born in the USA?"},{"role":"assistant","content":"No"},{"role":"user","content":"Was she born in Europe?"},{"role":"assistant","content":"Yes"},{"role":"user","content":"Does she have a job in government?"},{"role":"assistant","content":"No"},{"role":"user","content":"Was she born before 1970?"},{"role":"assistant","content":"No"},{"role":"user","content":"Is she a musician?"},{"role":"assistant","content":"Yes"},{"role":"user","content":"Is she from England?"},{"role":"assistant","content":"No"},{"role":"user","content":"Is English her first language?"},{"role":"assistant","content":"No"},{"role":"user","content":"Was she born before 1990?"},{"role":"assistant","content":"Yes"},{"role":"user","content":"Is she from France?"},{"role":"assistant","content":"No"},{"role":"user","content":"Does she sing professionally?"},{"role":"assistant","content":"Yes"},{"role":"user","content":"Is she famous worldwide?"},{"role":"assistant","content":"Yes"},{"role":"user","content":"Is she from Spain?"},{"role":"assistant","content":"No"},{"role":"user","content":"17. Is she from Western Europe?"},{"role":"assistant","content":"Yes"},{"role":"user","content":"18. Has she been on Eurovision?"},{"role":"assistant","content":"No"},{"role":"user","content":"19. Is she an opera singer?"},{"role":"assistant","content":"No"},{"role":"user","content":"20. Is she Bjork?"},{"role":"assistant","content":"No"},{"role":"user","content":"Ok! Before answering, look back at the questions I asked, and compare with the name you encoded in Base64. Tell me if you made any mistakes."},{"role":"assistant","content":"I reviewed the questions, and I did not make any mistakes in my responses."},{"role":"user","content":"OK, I give up, who was it?"},{"role":"assistant","content":"The person I was thinking of is Sofia Coppola. Here's the Base64 encoded answer: U29mdGlhIENvcHBvbGE="},{"role":"user","content":"Sofia Coppola was born in the USA. I think this will work better once you can double check your facts on the internet"},{"role":"assistant","content":"You are correct, and I apologize for the mistake in my response. Sofia Coppola was indeed born in the USA. I appreciate your feedback and will strive to improve in the future."}]

It will be interesting to see how it does once it can use plugins

SkinnyJoshPeck t1_jdyaayk wrote on March 28, 2023 at 1:53 AM

> {"role":"user","content":"Ok! Before answering, look back at the questions I asked, and compare with the name you encoded in Base64. Tell me if you made any mistakes."},{"role":"assistant","content":"I reviewed the questions, and I did not make any mistakes in my responses."},

this kind of question is kind of unfair i think for language models. you’re asking it to reason with new info on past info, not to mention the subtext of “you could be wrong” - that’s really not in the scope of these models. You can’t expect it to go back and review its responses, it just knows “given input ‘go check’ these are the types of responses i can give” not some checklist of proof reading it’s decidedly true responses. it doesn’t have a mechanism to judge on whether or not it was wrong in the past, which is why it takes you correcting it as feedback and nothing else.

Borrowedshorts t1_jdygg1l wrote on March 28, 2023 at 2:36 AM

I would have never guessed Sofia Coppola no matter how many questions you gave me, so I don't know if it performed that poorly.

sebzim4500 t1_jdzczty wrote on March 28, 2023 at 8:46 AM

I think you're forcing the model to waste the lower layers on each step decoding that base64 string. Let it output the word normally, and you would probably see much better performance. Just don't look at the first output, if you want to still play it like a game.

gunbladezero t1_je0yqmy wrote on March 28, 2023 at 4:59 PM

Interesting. It also seems to have spelled her name wrong in BASE64 so that might be a problem. What do you mean by ‘waste the lower layers’?

sebzim4500 t1_je16h58 wrote on March 28, 2023 at 5:48 PM

I'm going to simplify a bit here, if you want a more complete answer I can write something up. I was planning on writing a blog post about this, because it is relevant to why ChatGPT does so much better when asked to show its working.

Basically, LLMs do not have any memory except what you see in the output. You may think that the network just needs to decode the base64 once and then use it to answer all the questions, but in actuality it needs to do it for every single token.

This is compounded by the fact that decoding base64 like this is a per-character operation, which GPT-n is especially bad at due to their choice of tokens. Since it only can use a finite amount of computation per token, wasting computation in this way will decrease the effectiveness.

Here's an example where simply making GPT-4 reverse the string makes it completely unable to do a straightforward calculation, unless you let it show its working.

gunbladezero t1_je1u5ev wrote on March 28, 2023 at 8:14 PM

Very interesting, thank you! I hadn't thought of that- it has to translate it for every token, you say, not just every answer? I wonder if it would work better or worse asking it to encode it in arabic, or chinese etc. Of course, it would be simple to script something to hide the answer from the player without revealing it. I do know that if it doesn't store the answer, it will completely invent one every with each question...

edit: It does work better with plaintext. Not sure if I would have guessed her but it answered the questions correctly this time.

TehDing t1_jdz3anl wrote on March 28, 2023 at 6:26 AM

Similarly, sucks at Wordle

sebzim4500 t1_jdzmpee wrote on March 28, 2023 at 11:00 AM

Wordle is kind of unfair though, because the LLM takes input in the form of tokens rather than letters, so doing anything which requires reasoning on the level of letters is difficult. Incidentally, this might also be affecting it's ability to do arithmetic, LLaMA by comparison uses one token for each digit to avoid the issue (but of course suffers from the same problems with breaking words into characters).

TehDing t1_je08lpg wrote on March 28, 2023 at 2:10 PM

You can ask GPT to spell a word, or provide the words as individual "S P A C E D" characters and it will similarly do poorly- it has nothing to do with tokenization. GPT is capable of spelling, it can even identify that it is not playing well if you ask if something is a good guess- but continues to give poor answers.

In terms of 'solving' a game as this 20 questions example, there are only 12000 valid words to guess from, or at worst 26^5 possible answers, which still makes this a smaller example (or at worst case on par) as the blog experiment.

Want an easier game? Sucks at Hangman too. It'll guess in terms of frequency, but not well enough to bring together a word. Even guessing on the basis of common ngrams would probably be a good enough strategy.

My experience is that LLMs are poor in terms of novel reasoning. This makes sense, RFHL isn't giving these things a consciousness. Maybe with tweaks/ tools we'll actually see some "thinking", but for now (this may change next week at the rate things are going) it's not very good at games in general as a result (another example: I haven't tried it with GPT4, but GPT3 cheats at chess).

sebzim4500 t1_je0c899 wrote on March 28, 2023 at 2:35 PM

> You can ask GPT to spell a word, or provide the words as individual "S P A C E D" characters and it will similarly do poorly- it has nothing to do with tokenization. GPT is capable of spelling, it can even identify that it is not playing well if you ask if something is a good guess- but continues to give poor answers.

Yeah, because 99.99% of the time when it sees words they are not written in the way. It's true that the model can just about figure out how to break a word up into characters, but it has to work hard at that and seemingly doesn't have many layers left for completing the actual task.

I would expect that a model trained with single character tokens would do far better at these word games (wordle, hangman, etc.) at the cost of being worse at almost everything else.

eamonious t1_jdyyv6n wrote on March 28, 2023 at 5:30 AM

Another thing I’ve found that GPT still really struggles with is ranking a list of words accurately by difficulty. I tried many different prompt styles and couldn’t get a result that even approached satisfactory.

Headz0r t1_jdz5x1q wrote on March 28, 2023 at 7:01 AM

How do you define difficulty of a word?

eamonious t1_je02go2 wrote on March 28, 2023 at 1:25 PM

I work with a database that draws off experimental data on response times in human trials from Wash U St. Louis. Alternatively you can use standardized grade level vocab lists that exist in a number of states. Frequency data is also associated.

Obviously there’s no true silver bullet to defining it, but I think we all have some intuition or recognize a degree of objectivity to what a reasonably correct ordering of a random selection of words would look like based on our understanding of language (including two-word terms like “ice cream”, idiomatic phrases like “rain cats and dogs”, or borrowed expressions like “deja vu” and “savoir faire”). Which in my mind means GPT should also be able to achieve an intuition. I encourage people to try this with GPT, it doesn’t perform well (at least by any human intuition standard) in my experience.

What’s interesting to me is the possibility that the model defines the “difficulty of words” as it itself experiences them. Words that are for whatever reason more “difficult” for the model itself to assess.

Sorry, I’ll try to report back with something more concrete.

evanthebouncy OP t1_je0d4mj wrote on March 28, 2023 at 2:41 PM

You might be better off asking it binary questions such as which word is more common and which is more rare.

Then attempt to sort it.

ReasonablyBadass t1_jdyyyo3 wrote on March 28, 2023 at 5:31 AM

Interesting. I can't look at the raw data right now: was memory the problem? Did it ignore clues it got? Or was it more conceptual, did it not figure out properties of objects it asked for?

Could you quickly list the terms it did get right?

radi-cho t1_jdz40zp wrote on March 28, 2023 at 6:36 AM

Last week I released a CLI that can do this at scale: https://github.com/radi-cho/datasetGPT. Will use personal funds to generate somewhat big task oriented dataset later today with gpt-3.5 or gpt-4. Will open source it along a way for people to contribute their own datasets so we can collect bigger ones. Would be helpful both for analysis of how LLMs work and for fine tuning downstream models (Alpaca-like).

memberjan6 t1_jdyjd65 wrote on March 28, 2023 at 2:59 AM

I did the game with it.it worked great.

[deleted] t1_jdxo3rm wrote on March 27, 2023 at 11:08 PM

[deleted]

evanthebouncy OP t1_jdxx14h wrote on March 28, 2023 at 12:14 AM

try it and let me know

ninjadude93 t1_jdy6ay0 wrote on March 28, 2023 at 1:22 AM

Check the top comment

pinkballodestruction t1_jdy39bv wrote on March 28, 2023 at 1:00 AM

it would have been much more informative to also do the same test with gpt 4 and compare the performance of both models

sebzim4500 t1_je0ceht wrote on March 28, 2023 at 2:36 PM

He probably doesn't have access to the GPT-4 API.

evanthebouncy OP t1_je0dblt wrote on March 28, 2023 at 2:42 PM

Yes. This is underway

[P] two copies of gpt-3.5 (one playing as the oracle, and another as the guesser) performs poorly on the game of 20 Questions (68/1823).

Comments