Viewing a single comment thread. View all comments

visarga OP t1_j18n1i1 wrote

An interesting fact, the current dataset size is 1T words. All the skills of language models come from this one TeraWord. We can get 10 TWords after we finish scraping everything, after that it depends on finding other sources. Speech data is 10,000 TWords though.

4

Imaginary_Ad307 t1_j196irp wrote

LLM are inefficient, better algorithms are going to be developed.

2