Viewing a single comment thread. View all comments

farmingvillein t1_j8qipd4 wrote

Let's think step by step:

You:

> I don't think the Related Works section of that paper provides any useful references.

Your own response to the question that was posed:

> https://arxiv.org/abs/1805.04623 > https://arxiv.org/abs/1702.04521

There is no possible way that you actually read the Related Works section you dismissed, given that the papers you cited are already covered in the same references you dismissed.

E.g., "Sharp Nearby, Fuzzy Far Away" is directly discussed in the cited "Transformer-XL":

> Empirically, previous work has found that LSTM language models use 200 context words on average (Khandelwal et al., 2018), indicating room for further improvement

> Simply comparing RNNs with and RNNs without memory doesn't tell you anything about how fast the memory fades out and that it never winds up being bigger than a Transformer

I never said this, so I'm not sure what your argument is.

> we know perfectly well that Transformers make excellent use of context windows larger than 50 or 200 tokens (as my two references show)

Neither of the papers you link to (assuming you are talking about your own comment at https://www.reddit.com/r/MachineLearning/comments/1135aew/r_rwkv4_14b_release_and_chatrwkv_a_surprisingly/j8pg3g7/) make any reference to Transformers.

If your claim is that the papers indicated that RNNs have a small window (sure) and that Transformers have a longer one, you're arguing (as you seem to be in your entire post) again against a strawman. Re-read what I actually wrote:

> in practice, their effective "context window" often doesn't look much different than a reasonable transformer, when we look at performance metrics against long sequences.

My statement here is an empirical one around performance--which, among other things, is why I reference Dai et al, who (among others!) do a fairly extensive breakdown of empirical performance differences of RNNs- versus transformer-type architectures against long text sequences.

The whole point is that an OP said that RNNs were attractive because of the theoretical infinite context--but my response was that 1) we don't really see that in practice, when we try to measure it directly (as both of our sources point out), and 2) we don't see evidence of superior long-distance behavior when testing against real-world(ish) data sets that should theoretically reward that. And that both of these points are encapsulated if you follow the reference I shared (or, as I noted, most reasonable "long-distance transformer" papers).

(As with all things research...someone may come out with a small modification tomorrow that invalidates everything above--but, for now, it represents the broad public (i.e., non-private) understanding of architecture behaviors.)

−1

gwern t1_j8s2du5 wrote

> There is no possible way that you actually read the Related Works section you dismissed, given that the papers you cited are already covered in the same references you dismissed.

Telling someone to read the Related Works section of every one of a dozen papers in the Related Works section of a paper is a ridiculous thing to suggest, and no, I did not recurse down n deep in a breadth-first search. I read the Related Works of that paper, as I said ("I don't think the Related Works section of that paper"), noted that they were a bunch of memory-related papers which might or might not cite the actually relevant research I had in mind, but life was too short to queue up a dozen papers just to check their RW when I already knew some useful ones. Giving someone a random reference and telling them to manually crawl the literature is not helpful. In contrast, the two references I provided directly bore on the question, they didn't maybe cite papers which might bury something relevant in a footnote or cite papers which might someday answer the question...

> I never said this, so I'm not sure what your argument is.

I was pointing out why it was irrelevant to bring up a paper which "compares w/ and w/o memory." Mildly interesting but such a comparison cannot show what was asked about the effective memory of RNNs. Of course it is better to have (any) memory than not.

> which, among other things, is why I reference Dai et al, who (among others!) do a fairly extensive breakdown of empirical performance differences of RNNs- versus transformer-type architectures against long text sequences.

Dai would in fact have been useful, had you referenced it in your original comment. Unless you mean, 'vaguely gestured in the direction of a paper which has 50+ references with 35 in the RW section alone, any of which could have been relevant and where the relevant benchmarking of Dai was not highlighted in the paper to begin with, nor is the relative context work mentioned in the abstract of Dai but buried at the end of the paper (with the RNN results hidden inside a table) so you just have to know it's already there, and claimed you 'reference it'.' Then sure, yeah, that was a useful reference. Thanks for the input.

> If your claim is that the papers indicated that RNNs have a small window (sure) and that Transformers have a longer one, you're arguing (as you seem to be in your entire post) again against a strawman.

It's not a strawman. It's not obvious a priori that Transformers would work so much better or that RNN histories fade out so fast, which is why it had to be empirically established that the history fades out completely, as opposed to any of the other reasons that RNNs could underperform (maybe they have history but can't learn a good algorithm exploiting their memory, say, or they could but they are poorly optimized - there are so many ways for NNs to break) and people were surprised by how well Transformers work. It is completely understandable that OP would expect RNN history to work better than it does, and would want some hard citeable evidence that it works so badly that Transformers, with their apparently brutal hard cutoff, wind up having much closer to 'infinite context' than RNNs themselves.

Thus, it's useful to provide references showing that. (Not references to unspecified references which may or may not show that - gl.)

1

farmingvillein t1_j8s7ygo wrote

This...is pretty astounding. Just have the grace to admit you were wrong, and move on.

> Telling someone to read the Related Works section of every one of a dozen papers in the Related Works section of a paper is a ridiculous thing to suggest

Then how can you possibly say:

> I don't think the Related Works section of that paper provides any useful references.

?

This is hardcore trolling. You can, and frequently do, do better than this.

You are literally pushing posts that are factually incorrect, and that you either know are factually incorrect, or are too lazy to validate either way.

This is the type of thing which blows up post quality in this sub.

> Giving someone a random reference and telling them to manually crawl the literature is not helpful.

This...is ridiculous. This is--traditionally--a very academic-friendly sub. This is how research works. "Here is where you can start a literature review on a bundle of related papers" is an extremely classic response which is generally considered helpful to complex and nuanced questions.

And underlying issue is actually very complex, as evidenced in part by the fact that your references do not actually answer the question. "Go read related works" can be obnoxious when there are a single one or two papers that do answer the question--but that is not the case here.

> In contrast, the two references I provided directly bore on the question

No they did not. They did not touch at all upon Transformers versus RNNs, which was the question. You've chosen to cherry-pick one slice of the problem and declare victory.

> It's not a strawman.

You don't seem to understand what a strawman is. Strawman:

> an intentionally misrepresented proposition that is set up because it is easier to defeat than an opponent's real argument.

I was not making this argument. You were making this argument. QED, this a strawman.

2