Viewing a single comment thread. View all comments

DaBobcat t1_je12b4q wrote

Here OpenAI and Microsoft were evaluating GPT4 on medical problems. In section 6.2 they specifically said that they found strong evidence that it was trained on "popular datasets like SQuAD 2.0 and the Newsgroup Sentiment Analysis datasets". In the appendix section B they explain how they measured whether it saw something in the training data. Point is, I think benchmarks are quite pointless if the training dataset is private and no one can verify that they did not train it on the test set, which they specifically said that in many cases it did

5