Hands-on with Hugging Face’s new tokenizers library

Start exploring and experimenting with NLP using Tokenizers

Published in

Heartbeat

6 min readFeb 13, 2020

What a year for natural language processing! We’ve seen great improvement in terms of accuracy and learning speed, and more importantly, large networks are now more accessible thanks to Hugging Face and their wonderful Transformers library, which provides a high-level API to work with BERT, GPT, and many more language model variants.

On the left Julien Chaumond and on the right Clément Delangue

Recently, Hugging Face released a new library called Tokenizers, which is primarily maintained by Anthony MOI, Pierric Cistac, and Evan Pete Walsh.

With the advent of attention-based networks like BERT and GPT, and the famous word embedding tokenizer introduced by Wu et al. (2016), we saw a small revolution in the world of NLP that actually disrupted the general representation of words in order to extend the use of more advanced neural networks.

Tokenization and normalization

Before we can start modeling or using any advanced neural network, we need to go through two important steps — tokenization and normalization.

A tokenizer is a tool that performs segmentation work. It cuts text into tags, called tokens. Each token corresponds to a linguistically unique and easily-manipulated label. Tokens are language dependent and are part of a process to normalize the input text to better manipulate it and extract its meaning later in the training process.

When you have a dataset, you’re never 100% sure that the text is clean and normalized. Using a good tokenizer ensures that the text that will get fed to the network is clean and safe.

In some cases, it becomes too difficult to capture meaningful units with just a few rules (especially vocabulary, for example), so a learning approach can be used. An annotated corpus can make it possible to learn the particular tokens to better extend their circulation to all incoming texts.

Thus, using tokenizers pre-trained on large datasets of compound and rare words makes it possible to avoid incorrectly splitting words—for example, words like “Bow tie” or “Father-in-law”.

When building neural networks, you have to choose what kind of data the network will be trained on. Most of the time, existing tokenizers will do the job, but in some cases you want to have the freedom to create your own tokenizer from your own dataset, or maybe your own technique of splitting words. That’s where Hugging Face’s new tokenizer library comes in handy.

The latest in deep learning — from a source you can trust. Sign up for a weekly dive into all things deep learning, curated by experts working in the field.

Hugging Face’s Tokenizer Library

The Hugging Face team chose to write the library in pure Rust. Smart move from them, as Python isn’t known for its speed. It’s also surprisingly very bold since they could have used C or C++.

But since the ML community is in love with Python, and it’s still the king of the field, they managed to create wrappers that bind the Rust version.

Setting up the environment

Create a directory and cd into it:

mkdir happyTokenizing
cd happyTokenizing

Create a virtual environment:

python -m venv env
source env/bin/activate #activate the environment

Install Rust [MacOS and Linux]:

curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Install the package using pip :

pip install tokenizers

After installing the package, run pip freeze. You should see all installed packages and their versions:

We’re all set now.

Start tokenizing

Let’s start by importing the main tokenizers that are already implemented by the package and instantiate a sentence that we’ll use as our main testing input:

Import functions:

from tokenizers import ByteLevelBPETokenizer, BPETokenizer, SentencePieceBPETokenizer, BertWordPieceTokenizer

Instantiate a sentence:

sentence = “We need small heroes so that big heroes can shine”

Download various vocabularies in the current directory using the terminal:

# Bert Base Uncased Vocabularywget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt# Bert Base Cased Vocabularywget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt# Bert Large Cased Vocabularywget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt# Bert Large Uncased Vocabularywget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt# GPT-2 Vocabularywget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt# GPT-2 Medium Vocabularywget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-vocab.json# GPT-2 Large Vocabularywget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-vocab.json

Start tokenizing:

from tokenizers import ByteLevelBPETokenizer, BPETokenizer, SentencePieceBPETokenizer, BertWordPieceTokenizer# My arbitrary sentence
sentence = "We need small heroes so that big heroes can shine"# Bert vocabularies
bertBaseCased = "bert-base-cased-vocab.txt"
bertBaseUncased = "bert-base-uncased-vocab.txt"
bertLargeCased = "bert-large-cased-vocab.txt"
bertLargeUncased = "bert-large-uncased-vocab.txt"# GPT-2 vocabularies
gpt2Vocab = "gpt2-vocab.json"
gpt2LargeVocab = "gpt2-large-vocab.json"# Instantiate a Bert tokenizers
WordPiece = BertWordPieceTokenizer(bertLargeUncased)
WordPieceEncoder = WordPiece.encode(sentence)# Print the ids, tokens and offsets
print(WordPieceEncoder.ids)
print(WordPieceEncoder.tokens)
print(WordPieceEncoder.offsets)

Here’s the expected outcome:

Each and every word is tokenized and gets a unique id and an offset :

id: unique identifier for each token
offset: starting and ending point in a sentence

You can also train your own tokenizer and even create your own—that’s the level of versatility the package provides to researcher and engineers.

Conclusion

Vive Hugging Face 🤗 and NLP!

In the last two years, I’ve been more focused on image processing and convolutional neural networks (CNNS). But I think 2020 will be the year of democratization of powerful NLP tools like Hugging Face’s Tokenizer library and many more.

Tokenizers give us a starting point to experiment and explore new techniques in word embedding and tokenizing. Perhaps something I haven’t had the chance to emphasize enough is how the Hugging Face team has managed to create something simple and incredibly fast.

To illustrate this, Steven van de Graaf does a great job of comparing performance metrics:

A small timing experiment on the new Tokenizers library — a write-up

Spoiler alert: It’s blazingly fast 🔥

towardsdatascience.com

Now that you can easily use either existing tokenizers or create your own, you can start exploring transformers and build your own network—maybe even the next BERT or GPT-2.

You can also perform various tasks using massive networks implemented in the Transformers library, such as:

Text generation
Classification (Topic, sentiment analysis, etc.)
Predict if a sentence is a continuation of another
Question and answer tasks

Thank you for reading this article. If you have any questions, don’t hesitate to send me an email at omarmhaimdat@gmail.com.

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Comet Newsletter), join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.