Simcurious t1_jdzatox wrote on March 28, 2023 at 8:13 AM

That's not correct, the benchmark they used only contained codeforce problems from after 2021.

From Horace's tweets: >Considering the codeforces results in the paper (very poor!), they might have only evaluated it on recent problems.

muskoxnotverydirty t1_jdzi41h wrote on March 28, 2023 at 10:01 AM

It's correct and it's not correct. The article mentions this, but then they say that it's likely that they weren't able to cleanly separate pre-2021 questions on non-coding benchmarks.

bjj_starter t1_jdzo3zq wrote on March 28, 2023 at 11:16 AM

But that's pure speculation. They showed that a problem existed with training data, and OpenAI had already dealt with that problem and wasn't hiding it at all - GPT-4 wasn't tested on any of that data. Moreover, it's perfectly fine for problems like the ones it will be tested on to be in the training data, as in past problems. What's important is that what it's actually tested on is not in the training data. There is no evidence that it was tested on training data, at this point.

Moreover, the Microsoft Research team was able to repeat some impressive results in a similar domain on tests that didn't exist before the training data cut-off. There isn't any evidence that this is a problem with a widespread effect on performance. It's also worth noting that it seems pretty personal for the guy behind this paper, judging by the way he wrote his tweet.

muskoxnotverydirty t1_je027xh wrote on March 28, 2023 at 1:24 PM

Yeah it's speculation. I agree.

> There is no evidence that it was tested on training data, at this point.

I think what the author is trying to say is that for some of these tests there's no evidence it was tested on training data but there's no evidence that it wasn't. But then the ability to generalize in the specific domain of the tests depends on that difference. If nothing else, it would be nice for those who publish test results to show how much they knew whether test data was in the training data. It seems to me that they could automate a search within the training set to see if exact wordage is used.

bjj_starter t1_je2ckb0 wrote on March 28, 2023 at 10:14 PM

>If nothing else, it would be nice for those who publish test results to show how much they knew whether test data was in the training data.

Yes, we need this and much more information about how it was actually built, what the architecture is, what the training data was, etc. They're not telling us because trade secrets, which sucks. "Open" AI.

sb1729 t1_jdzgfff wrote on March 28, 2023 at 9:37 AM

They mention that in the article.

Simcurious t1_jdzhbyu wrote on March 28, 2023 at 9:50 AM

The title implies that they evaluated on data from before 2021 while the source says they didn't.