Viewing a single comment thread. View all comments

Nameless1995 t1_irkutz6 wrote

> GPT-3 has a prompt limit of about ~2048 "tokens", which corresponds to about 4 characters in text.

What do you mean 2048 tokens corresponding to 4 characters? The tokens are at subwords levels, they are much bigger than 4 characters.

> this limitation comes from amount of the input neurons

The 2048 limit comes from using trianable positional embeddings. Otherwise there is a lookup table for each possible token to map into a corresponding embeddings, and then the same input neurons are used for each embeddings. Without trainable position embeddings (for example if something like relative embeddings are used, or trigonometric functions like in original transformers), there is no official "prompt limit" (and some of the GPT-3-like models probably don't). The only limit to how big the input can be in those case is your gpu (or cpu depending on what you are using) memory.

But having arbitrarily long input is different from continuous training. With simply continuous training you would be changing the values of existing weights only (and you can already do that: but you will run into issues described by /u/suflaj). However, with arbitrarily long input prompt, you would be adding more and more tokens to attend -- increasing computational complexity massively. Even without an official prompt limit, you will most likely run into practical limit at around 2K tokens anyway. One way to resolve that would be to use the model semi-recurrently with few hidden states compressing an arbitrarily long past, so that attention remains bounded in complexity. But that would also mean a lot of past information would be lost since there is a limit to how much you can compress into some bounded budget. Although you can probably extend this paradigm by building a massive dictionary of memories while compressing past input, and then using sparse & fast top-k retrieval for bound computation. Someone would probably make something like that someday.

1