Viewing a single comment thread. View all comments

JackandFred t1_iuzb389 wrote

I feel like if your going to include transformers you should include the attention is all you need paper.

7

PassionatePossum t1_iv05451 wrote

I would only include as a historical reference. It is certainly not a "must read" paper. It is written so poorly that you are better off to just look at the code.

1

flaghacker_ t1_iv5jf05 wrote

What's wrong with it? They explain all the components of their model in enough detail (in particular the multi head attention stuff), provide intuition behind certain decisions, include clear results, they have nice pictures, ... What could have been improved about it?

2