brates09 t1_ix4xxas wrote
Autoregressive next token prediction is incredibly compute efficient. By using causally masked attention you can make a valid prediction for every token in a sequence with a single forward pass during training. I imagine this is a large part of why AR models eg GPT won out in popularity over masked token prediction models (eg BERT).
Nameless1995 t1_ix6gugr wrote
The jury is probably not out yet on that. Inititally, IIRC, BERT was posited to be better than GPT-style training for bidirectionality in modeling, and it was empirically shown too. But GPT-style model won out by scaling up much more. There may be some truth to what you said, because GPT gets much more training signal per iteration it may ultimately give us better result after scaled up, but I am not entirely sure why BERT-style model was not as scaled up (did people never try out of a priori hesitancy or they try didn't get good results and didn't report). Another issue is the rise of prompting - which is much more in tune with an autoregressive unidirectional training and it just falls out naturally much more easily from GPT-style training.
However, T5 style training is closer to BERT (T5 has a bidirectional encoder and a causal decoder, but the decoder only predicts some masked spans) to an extent. Recently this paper shows that you can get on par performance to fully causal decoder by using a scaled up T5 style model through a trick: https://arxiv.org/abs/2209.14500...again this may not get much practical purchase given how expensive SAP (the trick in the paper) can be but the question is open -- and perhaps there is a better middle way somewhere.
blazejd OP t1_ix7jpz2 wrote
This is interesting, but I was thinking a bit more high-level. In essence, BERT and GPT are both self-supervised language models trained on passive data with a similar objective.
Viewing a single comment thread. View all comments