Submitted by soraki_soladead t3_zmoxp7 in MachineLearning
soraki_soladead OP t1_j0cmgzs wrote
Reply to comment by Rabrg in [D] Trying to find paper about n-grams in early transformer layers by soraki_soladead
Perfect. Thank you! That explains why I couldn't find it.
EDIT: Spoke too soon. I think this covers some of the same ideas but it isn't the one I'm remembering. There's no method for simplifying the earlier layers of the transformer and exploiting the fact that they primarily learn bigrams. I could have sworn I read about it in an arxiv or openreview paper.
Viewing a single comment thread. View all comments