Attention mechanism

Overview

The attention mechanism is the core innovation of the Transformer architecture, introduced in “Attention is All You Need” (Vaswani et al., 2017). It allows each token in a sequence to dynamically attend to every other token, weighting how much information to pull from each depending on context. This replaces the fixed-window or recurrent processing of earlier architectures and enables capturing long-range dependencies across entire sequences in a single operation.

Query, Key, Value (Q/K/V)

For each token embedding, three vectors are computed by multiplying with learned weight matrices:

Query (Q): “what am I looking for?” — the current token’s question to the rest of the sequence
Key (K): “what do I contain?” — what each other token offers to be matched against
Value (V): “what do I contribute?” — the actual information extracted once a match is scored

Web-search analogy: Q is the search query you type; K is the page title in results; V is the page content you read once you click.

Attention scores are computed as:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) · V

The scaling by sqrt(d_k) prevents dot products from growing too large in high dimensions, which would push softmax into saturation.

Masked (causal) self-attention

In decoder-only models (GPT family), a causal mask is applied: the upper triangle of the attention score matrix is set to −∞ before softmax, so each position can only attend to itself and earlier positions. This enforces the autoregressive property — the model cannot “peek” at future tokens during training.

Steps inside one attention head:

Compute QK^T dot-product matrix (token × token attention scores)
Scale by 1/sqrt(d_k)
Apply upper-triangle mask (set future positions to −∞)
Softmax over each row → probability weights summing to 1
Multiply by V matrix → weighted sum of value vectors per token
Optional dropout on the attention weights

Multi-head attention

Instead of one attention operation, the model runs h parallel heads, each with independent Q/K/V weight matrices:

GPT-2 (small): 12 heads; each head operates on a 64-dimensional slice of the 768-dimensional embedding
Different heads specialise: one may capture subject–verb agreement, another long-range coreference, another syntactic structure
Head outputs are concatenated and linearly projected back to d_model

Multi-head attention allows the model to simultaneously attend to information from different representational subspaces at different positions.

Cross-attention vs self-attention

Self-attention: Q, K, V all come from the same sequence (used in encoder and decoder self-attention layers)
Cross-attention: Q comes from the decoder, K and V come from the encoder — used in encoder-decoder Transformers (e.g. T5, BART) to condition generation on an encoded input

GPT-2 and other decoder-only models use only self-attention (masked).

Computational complexity

Standard self-attention is O(n²·d) in sequence length n and dimension d. This quadratic cost in n becomes a bottleneck for very long sequences and has motivated approximations:

Sparse attention (Longformer, BigBird)
Linear attention approximations
Flash Attention (IO-aware exact attention via tiling)

Transformer architecture — the full architecture in which attention is embedded
— the input vectors that Q/K/V projections are applied to
— how text is split before embeddings and attention are applied
— mathematical derivation of attention from first principles

Resources

2026-06-23 ◦ Transformer Explainer (Polo Club, Georgia Tech) — interactive step-by-step Q/K/V walkthrough with live GPT-2; visualises 12-head splitting, masked attention matrix, and output concatenation
2026-06-23 ◦ How generative AI works (FT interactive) — visual treatment of attention in the context of generative AI; paywalled, to read