Submitted by Balance- t3_124eyso in MachineLearning
Nhabls t1_je94xwx wrote
Reply to comment by bjj_starter in [N] OpenAI may have benchmarked GPT-4’s coding ability on it’s own training data by Balance-
Not misleading. The fact it performs so differently on easy problems it has seen Vs not , specially when it fails so spectacularly on the latter does raise big doubts about how corrupted and unreliable their benchmarks might be
bjj_starter t1_je98wdx wrote
Okay, but an external team tested it on coding problems which only came into existence after its training finishes, and found human level performance. I don't think your theory explains how that could be the case.
Nhabls t1_je9anrq wrote
Which team is that? The one at Microsoft that made up the human performance figures in a completely ridiculous way? Basically "We didn't like that pass rates were too high for humans for the hard problems that the model fails on completely so we just divided the accepted number by the entire user base" oh yeah brilliant
The "human" pass rates are also composed of people learning to code trying to see if their solution works. Its a completely idiotic metric, why not go test randos on the street and declare that represents the human coding performance metric while we're at it
Viewing a single comment thread. View all comments