Viewing a single comment thread. View all comments

Nhabls t1_je94xwx wrote

Not misleading. The fact it performs so differently on easy problems it has seen Vs not , specially when it fails so spectacularly on the latter does raise big doubts about how corrupted and unreliable their benchmarks might be

6

bjj_starter t1_je98wdx wrote

Okay, but an external team tested it on coding problems which only came into existence after its training finishes, and found human level performance. I don't think your theory explains how that could be the case.

1

Nhabls t1_je9anrq wrote

Which team is that? The one at Microsoft that made up the human performance figures in a completely ridiculous way? Basically "We didn't like that pass rates were too high for humans for the hard problems that the model fails on completely so we just divided the accepted number by the entire user base" oh yeah brilliant

The "human" pass rates are also composed of people learning to code trying to see if their solution works. Its a completely idiotic metric, why not go test randos on the street and declare that represents the human coding performance metric while we're at it

1