brates09

brates09 t1_ix4xxas wrote

Autoregressive next token prediction is incredibly compute efficient. By using causally masked attention you can make a valid prediction for every token in a sequence with a single forward pass during training. I imagine this is a large part of why AR models eg GPT won out in popularity over masked token prediction models (eg BERT).

5