juliensalinas

juliensalinas OP t1_jcky8ok wrote

You're welcome.

A token is a unique entity that can either be a small word, part of a word, or punctuation.
On average, 1 token is made up of 4 characters, and 100 tokens are roughly equivalent to 75 words.
Natural Language Processing models need to turn your text into tokens in order to process it.

9

juliensalinas OP t1_jckwtdj wrote

No if you want such a model to "remember" previous prompts you will need to add them at the top of each requests you are making.

The output can be up to 2048 tokens. But on a Tesla T4 you might not have enough VRAM so maybe you will be limited to 1024 tokens because the GPU will run out of memory above that.

9

juliensalinas OP t1_jcktk8o wrote

This is maybe something I'll focus on in the future. But for the moment I find this fp16 version well suited for small budgets as it runs on a 16GB GPU while the native fp32 version of GPT-J requires at least 24GB of VRAM.

Also, with the bitsandbytes integration in HF Transformers you can use the model in 8 bits: https://huggingface.co/blog/hf-bitsandbytes-integration

12

juliensalinas t1_ittw69q wrote

Congrats and thanks a lot u/pommedeterresautee for this amazing project. As usual, your in-depth explanations about low level machine learning are very insightful.

Transformer Deploy was already very exciting, and this new project seems even more promising!

Can't wait to try it for real and see if we can use it behind NLP Cloud somehow.

8