Overview

Word embeddings are dense numerical vector representations of tokens (words or sub-words) in a high-dimensional space. The core property is that tokens with similar meanings or usage patterns are placed close together in this space, while dissimilar tokens are farther apart. Embeddings are the bridge between discrete symbolic text and the continuous mathematics of neural networks.

Embedding matrix

In a Transformer, the embedding layer is a matrix of shape (vocab_size, d_model). Each row is the embedding vector for one token. Given a token ID, its embedding is retrieved by a simple lookup (indexing into the matrix). The entire matrix is a learned parameter updated during training.

GPT-2 (small) specifics:

Semantic geometry

The high-dimensional space encodes meaning geometrically:

This geometry emerges from training on next-token prediction — the model learns representations that make prediction tractable, and semantic similarity turns out to be highly predictive.

Token embedding vs contextual embedding

Positional encoding and the final embedding

Because Transformers are permutation-invariant without explicit position information, a positional encoding vector is added to the token embedding before the first Transformer block:

final_embedding = token_embedding + positional_embedding

This combined vector carries both the semantic identity of the token and its position in the sequence.

GPT-2 positional embeddings

GPT-2 trains a separate positional embedding matrix of shape (max_seq_len, 768). This is a learned parameter — unlike the fixed sinusoidal scheme in the original Transformer paper — and is summed element-wise with the token embedding.

Embeddings beyond words

The embedding paradigm generalises:

Resources