Overview

The attention mechanism is the core innovation of the Transformer architecture, introduced in “Attention is All You Need” (Vaswani et al., 2017). It allows each token in a sequence to dynamically attend to every other token, weighting how much information to pull from each depending on context. This replaces the fixed-window or recurrent processing of earlier architectures and enables capturing long-range dependencies across entire sequences in a single operation.

Query, Key, Value (Q/K/V)

For each token embedding, three vectors are computed by multiplying with learned weight matrices:

Web-search analogy: Q is the search query you type; K is the page title in results; V is the page content you read once you click.

Attention scores are computed as:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) · V

The scaling by sqrt(d_k) prevents dot products from growing too large in high dimensions, which would push softmax into saturation.

Masked (causal) self-attention

In decoder-only models (GPT family), a causal mask is applied: the upper triangle of the attention score matrix is set to −∞ before softmax, so each position can only attend to itself and earlier positions. This enforces the autoregressive property — the model cannot “peek” at future tokens during training.

Steps inside one attention head:

  1. Compute QK^T dot-product matrix (token × token attention scores)
  2. Scale by 1/sqrt(d_k)
  3. Apply upper-triangle mask (set future positions to −∞)
  4. Softmax over each row → probability weights summing to 1
  5. Multiply by V matrix → weighted sum of value vectors per token
  6. Optional dropout on the attention weights

Multi-head attention

Instead of one attention operation, the model runs h parallel heads, each with independent Q/K/V weight matrices:

Multi-head attention allows the model to simultaneously attend to information from different representational subspaces at different positions.

Cross-attention vs self-attention

GPT-2 and other decoder-only models use only self-attention (masked).

Computational complexity

Standard self-attention is O(n²·d) in sequence length n and dimension d. This quadratic cost in n becomes a bottleneck for very long sequences and has motivated approximations:

Resources