Nhabls t1_je94xwx wrote on March 30, 2023 at 9:31 AM

Reply to comment by bjj_starter in [N] OpenAI may have benchmarked GPT-4’s coding ability on it’s own training data by Balance-

Not misleading. The fact it performs so differently on easy problems it has seen Vs not , specially when it fails so spectacularly on the latter does raise big doubts about how corrupted and unreliable their benchmarks might be

bjj_starter t1_je98wdx wrote on March 30, 2023 at 10:25 AM

Okay, but an external team tested it on coding problems which only came into existence after its training finishes, and found human level performance. I don't think your theory explains how that could be the case.

Nhabls t1_je9anrq wrote on March 30, 2023 at 10:48 AM

Which team is that? The one at Microsoft that made up the human performance figures in a completely ridiculous way? Basically "We didn't like that pass rates were too high for humans for the hard problems that the model fails on completely so we just divided the accepted number by the entire user base" oh yeah brilliant

The "human" pass rates are also composed of people learning to code trying to see if their solution works. Its a completely idiotic metric, why not go test randos on the street and declare that represents the human coding performance metric while we're at it