fmai t1_j6hjauf wrote on January 30, 2023 at 11:02 AM

GPT-3 ranks relatively low on SuperGLUE because it was not finetuned on the SuperGLUE tasks, whereas T5, etc. were. The amazing feat about GPT-3 is that you can reach impressive performance with just few-shot prompting, which was unknown before.

As to your questions:

AFAIK, OpenAI hasn't published any numbers themselves and nobody outside of OpenAI has API access to ChatGPT yet, making it difficult to assess its performance on often thousands of examples from a benchmark. So, no, so far the performance improvement hasn't been quantified.
No, there is no quantitative analysis. Most people seem to agree that, anecdotally, ChatGPT seems to hallucinate far less than GPT-3. But you can definitely get ChatGPT to generate bullshit if you keep digging, so it's far from perfect. Depending on what story you want to tell, some people will emphasize one or the other. Take it all with a grain of salt until we get solid numbers.
AFAIK, LLMs are fantastic at closed-book question answering, where you're not allowed to look at external resources. I think a T5 based model was the first to show that it can answer trivia questions well from knowledge stored in the model parameters only. For open-book QA you will need to augment the LLM with some retrieval mechanism (which ChatGPT doesn't have yet), and therefore you can expect other models to be much better in this regard.

mettle OP t1_j6im3ap wrote on January 30, 2023 at 4:30 PM

Thanks for this thoughtful answer.

Re: 2, are there solid numbers we would conceptual even be able to get? Are there known ongoing efforts?