Language Modeling

A chronological account of how machines learned to model language.

Part I · The Statistical Era

The Counting Approach Shannon entropy, n-gram models, perplexity, and Kneser-Ney smoothing

Distributed Representations From co-occurrences to learned embeddings — Word2Vec, GloVe, and the geometry of meaning

Sequence Memory RNNs, LSTMs, the vanishing gradient, and the first contextual representations

Part II · The Architecture Revolution

The Attention Mechanism Bahdanau attention, seq2seq, and the soft database lookup that changed everything

Attention is All You Need Self-attention, multi-head attention, positional encoding, and the Transformer

Pre-training and the Two Paradigms GPT vs. BERT, causal vs. masked language modeling, and the transfer learning recipe

Part III · Scale, Emergence, and Alignment

Emergence and In-Context Learning Scaling laws, GPT-3, few-shot learning, and capabilities that appear discontinuously

Alignment and Instruction Following RLHF, InstructGPT, Constitutional AI, and DPO — teaching models to be helpful

Efficiency, Long Context, and New Architectures MoE, FlashAttention, Mamba, and the question of what comes after the Transformer

The Frontier Chain-of-thought, multimodal integration, open models, and the open questions