Antonio Franca
Language Modeling
A chronological account of how machines learned to model language.
Part I · The Statistical Era
01
The Counting Approach
Shannon entropy, n-gram models, perplexity, and Kneser-Ney smoothing
1948 – 2000
02
Distributed Representations
From co-occurrences to learned embeddings — Word2Vec, GloVe, and the geometry of meaning
2003 – 2014
03
Sequence Memory
RNNs, LSTMs, the vanishing gradient, and the first contextual representations
2013 – 2017
Part II · The Architecture Revolution
04
The Attention Mechanism
Bahdanau attention, seq2seq, and the soft database lookup that changed everything
2015 – 2017
05
Attention is All You Need
Self-attention, multi-head attention, positional encoding, and the Transformer
2017
06
Pre-training and the Two Paradigms
GPT vs. BERT, causal vs. masked language modeling, and the transfer learning recipe
2018 – 2020
Part III · Scale, Emergence, and Alignment
07
Emergence and In-Context Learning
Scaling laws, GPT-3, few-shot learning, and capabilities that appear discontinuously
2020 – 2021
08
Alignment and Instruction Following
RLHF, InstructGPT, Constitutional AI, and DPO — teaching models to be helpful
2021 – 2023
09
Efficiency, Long Context, and New Architectures
MoE, FlashAttention, Mamba, and the question of what comes after the Transformer
2022 – 2025
10
The Frontier
Chain-of-thought, multimodal integration, open models, and the open questions
2023 –