Antonio Franca

Language Modeling

A chronological account of how machines learned to model language.

Part I · The Statistical Era
01
The Counting Approach Shannon entropy, n-gram models, perplexity, and Kneser-Ney smoothing
1948 – 2000
02
Distributed Representations From co-occurrences to learned embeddings — Word2Vec, GloVe, and the geometry of meaning
2003 – 2014
03
Sequence Memory RNNs, LSTMs, the vanishing gradient, and the first contextual representations
2013 – 2017
Part II · The Architecture Revolution
04
The Attention Mechanism Bahdanau attention, seq2seq, and the soft database lookup that changed everything
2015 – 2017
05
Attention is All You Need Self-attention, multi-head attention, positional encoding, and the Transformer
2017
06
Pre-training and the Two Paradigms GPT vs. BERT, causal vs. masked language modeling, and the transfer learning recipe
2018 – 2020
Part III · Scale, Emergence, and Alignment
07
Emergence and In-Context Learning Scaling laws, GPT-3, few-shot learning, and capabilities that appear discontinuously
2020 – 2021
08
Alignment and Instruction Following RLHF, InstructGPT, Constitutional AI, and DPO — teaching models to be helpful
2021 – 2023
09
Efficiency, Long Context, and New Architectures MoE, FlashAttention, Mamba, and the question of what comes after the Transformer
2022 – 2025
10
The Frontier Chain-of-thought, multimodal integration, open models, and the open questions
2023 –