Comments

You must log in or register to comment.

suflaj t1_is327qu wrote

I had to check that I'm not on a satire sub comparing the title and that quote.

41

farmingvillein t1_is35lpb wrote

Well, the key claim of the paper (which OP should have instead reflected in the top-level post) is not that there is a big accuracy increase, but that performance is equal or better, while being computationally advantaged:

> We show an in-depth evaluation and demonstrate how wide models require a far smaller memory footprint and can run faster on commodity hardware, in addition, these wider models are also more interpretable

I.e., get ~equal performance at lower cost (what's not to like?).

That said, the real issue with this paper is that they only look at very small datasets...which makes the paper basically useless for making grandiose claims like:

> WIDE ATTENTION IS THE WAY FORWARD FOR TRANSFORMERS

That doesn't mean that the paper itself is useless, of course...it is an interesting data point...but they absolutely should not have chosen that title.

32

suflaj t1_is363fo wrote

I didn't mean that it is useless. I find it funny that someone would actually say that instead of "they perform roughly the same". Especially since they do not show that is a statistically significant difference, we have seen your average BERT get much more performance by just rerolling on a different seed.

5

farmingvillein t1_is36n5p wrote

Sorry, didn't mean to imply that you were saying that it was useless--that is in response to my own criticism of the paper's title (versus the paper itself).

> I find it funny that someone would actually say that instead of "they perform roughly the same"

Yeah...for better or worse, though, if you say something performs "on parity", people assume (because it is frequently true...) that what you really mean is "-0.1% but that totally isn't a big deal".

I don't fault them for highlighting the 0.3% as a light pushback on the above, but I do blame 1) OP in their post highlighting this point (which, to your point, is at best misleading about the key claims of the paper) and 2) the authors for picking the ludicrous title.

3

Historical_Ad2338 t1_is3636c wrote

Genuinely shocking, Scaling laws for Neural Language Models figure 6 found that single layers weren't supposed to scale as well (with the same parameters) though ofc the fine details of this new paper are diff.

6

mrpogiface t1_is400t9 wrote

Yeah, I don't think the OP paper did any scaling experiments, so I'm a bit sceptical long term, but it would be awesome for efficiency if it worked out.

Also, it turns out that the scaling laws in the paper you linked weren't quite right either (a la chinchilla) so who knows, maybe there is something that was missed when you move out of the infinite data regime

2