Tokenisation

Overview

Tokenisation is the process of breaking raw text into discrete units called tokens before a language model processes it. Tokens are the atomic inputs to a model’s embedding layer; each token is assigned a unique integer ID and then looked up in the embedding matrix. Tokens can correspond to whole words, sub-words, individual characters, or byte sequences depending on the algorithm used. The vocabulary — the complete set of valid tokens — is fixed before training.

Why not characters or words?

Character-level: very long sequences, loses morphological groupings
Word-level: huge vocabulary, cannot handle rare or misspelled words, no sharing between related forms
Sub-word (dominant approach): compact vocabulary, handles novel words by decomposing them, shares embeddings across morphological variants

Common sub-word algorithms

Byte Pair Encoding (BPE)

Start with individual bytes/characters; repeatedly merge the most frequent adjacent pair until a target vocabulary size is reached. Used by GPT-2, GPT-3, GPT-4. GPT-2 has a vocabulary of 50,257 tokens.

WordPiece

Similar to BPE but merges based on likelihood increase rather than raw frequency. Used by BERT.

SentencePiece / Unigram

Language-agnostic; treats the tokenisation problem probabilistically and can produce multiple valid segmentations. Used in LLaMA, T5, many multilingual models.

Tokenisation in GPT-2

Vocabulary size: 50,257 tokens
Algorithm: BPE over byte sequences (handles any Unicode without unknown-token issues)
Single words (“Data”, “visualization”) often map to one token each
Less common words are split: “empowers” → [“emp”, “owers”]
The token “embedding matrix” has shape (50,257, 768), holding ~39M parameters in GPT-2 small

Positional encoding

Because the Transformer attention operation is permutation-invariant (it sees a set, not a sequence), position must be injected separately. Two main approaches:

Learned positional embeddings (GPT-2): a separate (max_seq_len, d_model) matrix is trained from scratch; position IDs are looked up just like token IDs and summed with token embeddings
Sinusoidal positional encoding (original Transformer, BERT): fixed mathematical functions of position; not learned

The final embedding = token embedding + positional embedding, giving the model both semantic meaning and sequence order.

Token count vs word count

Tokens are not words:

English text averages roughly 1.3–1.5 tokens per word
Code and non-English text may tokenise less efficiently
Context-window limits (e.g. 128K tokens) are therefore shorter in word-equivalents than the raw number implies

Transformer architecture — tokenisation is the first preprocessing step before any Transformer layer
— the lookup table that maps token IDs to dense vectors
Attention mechanism — operates on the embedded token sequences
— tokeniser implementation is an early milestone in from-scratch curricula

Resources

2026-06-23 ◦ Transformer Explainer (Polo Club, Georgia Tech) — interactive tokenisation step: shows how “Data visualization empowers users to” is split into GPT-2 BPE tokens with IDs, then embedded as 768-dimensional vectors
2026-06-23 ◦ How generative AI works (FT interactive) — visual explainer on how text becomes tokens; paywalled, to read