hysse t1_j2qqwsf wrote on January 3, 2023 at 7:27 AM

Which tool is the best to train a tokenizer ? HuggingFace library seems the simplest one but is it the most efficient (computing) ? If yes, what torchtext, nltk... are useful for ?

jakderrida t1_j2zxul1 wrote on January 5, 2023 at 2:35 AM

The Hugging Face library is a popular tool for training a tokenizer and is relatively easy to use. It is based on the Transformers library, which is built on top of PyTorch, and it provides a wide range of pre-trained models and tools for natural language processing tasks.

In terms of efficiency, the Hugging Face library should be sufficient for most use cases. However, if you need to train a very large model or you want to optimize the training process for maximum efficiency, you may want to consider using a more specialized library like PyTorch or TensorFlow directly.

Other natural language processing libraries like NLTK (Natural Language Toolkit) and torchtext are also useful for a variety of tasks, such as text preprocessing, part-of-speech tagging, and language modeling. NLTK is a general-purpose library that provides a wide range of tools for working with human language data, while torchtext is a PyTorch library that provides tools for preprocessing and working with text data in PyTorch.

hysse t1_j30tpx8 wrote on January 5, 2023 at 7:15 AM

Thanks for the answer. I need to train a relatively large model and I need an efficient tokenizer.

I don't see how a tokenizer written in pytorch (or tensorflow) can be faster than a HuggingFace tokenizer (for example). HuggingFace has a rust backend that make the tokenizer faster and I guess that torchtext has an optimized backend too.

Knowing that the tokenizer run in cpu and not gpu, how can it run faster if I wrote it with pytorch (or even in python) ?

jakderrida t1_j30u1jq wrote on January 5, 2023 at 7:18 AM

To be clear, ChatGPT provided the answer.

Here's how ChatGPT answered your next question...

It is generally true that a Rust backend can provide better performance than a pure Python implementation, especially for a high-throughput task like tokenization. This is because Rust is a systems programming language that is designed to be fast and efficient. It can also make use of low-level optimization techniques that are not possible in Python, such as manual memory management.

That being said, it is still possible to write a fast and efficient tokenizer in Python. There are several ways to do this:

Use optimized libraries: There are several libraries in Python that are optimized for performance, such as NumPy and PyPy. Using these libraries can help to improve the performance of your tokenizer.
Use Cython or Numba: Cython and Numba are tools that can be used to optimize Python code by compiling it to native machine code. This can provide a significant performance boost over pure Python code.
Use multithreading or multiprocessing: If your tokenizer is CPU-bound, you can use multithreading or multiprocessing to parallelize the work and make use of multiple CPU cores.
Use a just-in-time (JIT) compiler: JIT compilers, such as Numba, can analyze your Python code at runtime and generate optimized machine code on the fly. This can provide a significant performance boost over pure Python code.

Overall, there are many ways to optimize the performance of a Python-based tokenizer. It is worth considering using a combination of these techniques to achieve the best results.

hysse t1_j30ub8q wrote on January 5, 2023 at 7:22 AM

Haha I knew it. Unfortunately, I don't think ChatGPT can give a good answer to that question...

jakderrida t1_j30uieu wrote on January 5, 2023 at 7:24 AM

Yeah, when I read it, I knew right away that I'd seem like a bigger imbecile if I let you think it was me. The first one was impressive, though.