Viewing a single comment thread. View all comments

ZestyData t1_jeagwa8 wrote

Thank you for repeating half of what I said back to me, much like ChatGPT you catch on quick to new information:

So, let's be clear here then. Contrary to your incorrect first comment; Google translate is an LLM, it is autoregressive, and it is pretrained. At least to the definition of pre-training given in the GPT paper, which was the parallel I first used in my own comment for OP who was coming into this thread with the knowledge of the latest GPT3+ and ChatGPT products.

​

>It's funny how you mention unrelated stuff, like RLHF

I did so because I had naively assumed you were also a newcomer to the field who knew nothing outside of ChatGPT, given how severely wrong your first comment was. I'll grant you that it wasn't related, except to lend an olive branch and reasonable exit-plan if that were the case for you. Alas.

​

>LLMs tend to be >>1B parameter models

Again, no. Elmo was 94 million, GPT was 120 milliom, GPT-2 was 1.5 billion. BERT has ~300 million parameters. These are all Large Language Models and have been called so for years.There is no hard definition on what constitutes "large". 2018's large is nearly today's consumer-hardware level. Google Translate (and its search) are a few of the most well-used LLMs actually out there.

Man. Why do you keep talking about things that you don't understand, even when corrected?

​

>Lastly, modelling p(y|x) is significantly easier and thus less general than modelling p(x).

Sure! It is easier! But that's not what you said. You'd initially brought up P(Y|X) as a justification that Translation isn't pre-trained. Those are two unrelated concepts. Its ultimate modelling goal is P(Y|X) but in both GPT (Generative Pre-training) and Google translate, they both pretrain their ability to predict P(X|context) in the decoder, just like any hot new LLM of today, hence my correction for you. The application towards ultimate P(Y|X) is not connected to the pretraining of their decoders.

1