Overview
The Transformer is a neural network architecture introduced in the 2017 paper “Attention is All You Need” that has become the dominant architecture for deep learning across text, audio, image, and other modalities. Text-generative Transformers operate on the principle of next-token prediction: given an input prompt, the model predicts the most probable next token (word or sub-word). The core innovation is the self-attention mechanism, which allows the model to process entire sequences and capture long-range dependencies more effectively than RNNs or LSTMs.
Three key components
Every text-generative Transformer has three major stages:
- Embedding: input text is broken into , each converted to a numerical vector () augmented with positional encoding
- Transformer blocks: stacked layers each containing a multi-head and an MLP sub-layer; representations deepen with each layer
- Output probabilities: a linear projection and softmax layer convert the final embeddings into a probability distribution over the vocabulary for next-token sampling
Transformer block internals
Each block contains, in order:
- Layer normalisation (pre-attention)
- Multi-head self-attention
- Residual connection (input + attention output)
- Layer normalisation (pre-MLP)
- MLP (two linear layers with GELU activation, 4× expansion then compression)
- Residual connection (input + MLP output)
GPT-2 (small) stacks 12 such blocks and has 124 million parameters total.
MLP sub-layer
The MLP processes each token independently (no cross-token communication):
- Linear layer expands from d_model (768 in GPT-2 small) to 4 × d_model (3072)
- GELU non-linearity
- Linear layer compresses back to d_model
While attention routes information between tokens, the MLP refines each token’s representation within the expanded space, encoding factual knowledge and higher-order patterns.
Auxiliary features
Residual connections
Shortcuts that add a layer’s input directly to its output, allowing gradients to flow through deep stacks without vanishing. First introduced by ResNet (2015); used twice per Transformer block in GPT-2.
Layer normalisation
Normalises activations across the feature dimension to keep mean ≈ 0 and variance ≈ 1 at each step. Applied before both the attention and MLP sub-layers (“pre-norm” placement). Stabilises training and improves convergence speed.
Dropout
Randomly zeroes a fraction of weights during training to prevent overfitting; deactivated at inference, effectively ensembling the trained sub-networks.
Output sampling
After the final Transformer block, a linear projection maps to vocabulary size (50,257 tokens in GPT-2) producing logits. Softmax converts logits to probabilities. Next token is then sampled using:
- Temperature: divides logits before softmax; <1 sharpens, >1 flattens the distribution
- Top-k: restricts sampling to the k highest-probability tokens
- Top-p (nucleus): restricts to the smallest set of tokens whose cumulative probability exceeds p
Scope beyond language
Transformers now power:
- Audio generation (speech and music models)
- Image recognition (Vision Transformers / ViT)
- Protein structure prediction (AlphaFold2)
- Game-playing agents
Related topics
- — the core self-attention operation inside each block
- — how raw text is split into tokens before embedding
- — the vector representations assigned to each token
- — curriculum covering Transformers from mathematical first principles
- LLM wiki — knowledge-base pattern built on top of Transformer-based models
Resources
- 2026-06-23 ◦ Transformer Explainer (Polo Club, Georgia Tech) — interactive browser-based walkthrough of GPT-2 covering tokenisation, embeddings, Q/K/V matrices, 12-head masked self-attention, MLP, layer norm, residual connections, and temperature/top-k/top-p sampling
- 2026-06-23 ◦ How generative AI works (FT interactive) — visual explainer on tokens, embeddings, and attention; paywalled, to read