Overview

Tokenisation is the process of breaking raw text into discrete units called tokens before a language model processes it. Tokens are the atomic inputs to a model’s embedding layer; each token is assigned a unique integer ID and then looked up in the embedding matrix. Tokens can correspond to whole words, sub-words, individual characters, or byte sequences depending on the algorithm used. The vocabulary — the complete set of valid tokens — is fixed before training.

Why not characters or words?

Common sub-word algorithms

Byte Pair Encoding (BPE)

Start with individual bytes/characters; repeatedly merge the most frequent adjacent pair until a target vocabulary size is reached. Used by GPT-2, GPT-3, GPT-4. GPT-2 has a vocabulary of 50,257 tokens.

WordPiece

Similar to BPE but merges based on likelihood increase rather than raw frequency. Used by BERT.

SentencePiece / Unigram

Language-agnostic; treats the tokenisation problem probabilistically and can produce multiple valid segmentations. Used in LLaMA, T5, many multilingual models.

Tokenisation in GPT-2

Positional encoding

Because the Transformer attention operation is permutation-invariant (it sees a set, not a sequence), position must be injected separately. Two main approaches:

The final embedding = token embedding + positional embedding, giving the model both semantic meaning and sequence order.

Token count vs word count

Tokens are not words:

Resources