Overview

The Transformer is a neural network architecture introduced in the 2017 paper “Attention is All You Need” that has become the dominant architecture for deep learning across text, audio, image, and other modalities. Text-generative Transformers operate on the principle of next-token prediction: given an input prompt, the model predicts the most probable next token (word or sub-word). The core innovation is the self-attention mechanism, which allows the model to process entire sequences and capture long-range dependencies more effectively than RNNs or LSTMs.

Three key components

Every text-generative Transformer has three major stages:

  1. Embedding: input text is broken into , each converted to a numerical vector () augmented with positional encoding
  2. Transformer blocks: stacked layers each containing a multi-head and an MLP sub-layer; representations deepen with each layer
  3. Output probabilities: a linear projection and softmax layer convert the final embeddings into a probability distribution over the vocabulary for next-token sampling

Transformer block internals

Each block contains, in order:

GPT-2 (small) stacks 12 such blocks and has 124 million parameters total.

MLP sub-layer

The MLP processes each token independently (no cross-token communication):

While attention routes information between tokens, the MLP refines each token’s representation within the expanded space, encoding factual knowledge and higher-order patterns.

Auxiliary features

Residual connections

Shortcuts that add a layer’s input directly to its output, allowing gradients to flow through deep stacks without vanishing. First introduced by ResNet (2015); used twice per Transformer block in GPT-2.

Layer normalisation

Normalises activations across the feature dimension to keep mean ≈ 0 and variance ≈ 1 at each step. Applied before both the attention and MLP sub-layers (“pre-norm” placement). Stabilises training and improves convergence speed.

Dropout

Randomly zeroes a fraction of weights during training to prevent overfitting; deactivated at inference, effectively ensembling the trained sub-networks.

Output sampling

After the final Transformer block, a linear projection maps to vocabulary size (50,257 tokens in GPT-2) producing logits. Softmax converts logits to probabilities. Next token is then sampled using:

Scope beyond language

Transformers now power:

Resources