Comments

You must log in or register to comment.

soraki_soladead OP t1_j0cmgzs wrote

Perfect. Thank you! That explains why I couldn't find it.

EDIT: Spoke too soon. I think this covers some of the same ideas but it isn't the one I'm remembering. There's no method for simplifying the earlier layers of the transformer and exploiting the fact that they primarily learn bigrams. I could have sworn I read about it in an arxiv or openreview paper.

3

2600_yay t1_j0fc7j4 wrote

"Are Neighbors Enough"'s authors swap out self-attention in a Transformer for a multi-head neural n-gram model? Perhaps that's what you're looking for?

https://arxiv.org/abs/2207.13354

3