Antonio Franca
Reinforcement Learning
From Bellman equations to modern alignment — the historical arc of sequential decision making.
Part I · The Classical Foundation
01
The Bellman Principle
State, action, reward, return, and the Bellman equations as a recursive decomposition of value
1957 – 1972
02
Dynamic Programming
Policy evaluation, policy iteration, value iteration, and their exact-but-intractable cost
1957 – 1984
03
Temporal Difference Learning
TD(0), TD(λ), the bootstrapping idea, SARSA and Q-learning
1988 – 1994
04
Tabular Q-Learning and Convergence
The Q-learning update, proof of convergence, and the tabular regime's limits
1989 – 1995
Part II · The Deep Learning Fusion
05
Deep Q-Networks
Experience replay, target networks, the Atari result, and the function approximation pathology
2013 – 2015
06
The Policy Gradient Theorem
REINFORCE, score function estimators, baseline subtraction, variance reduction
1992 – 2000
07
Actor-Critic and Advantage Estimation
A3C/A2C, GAE, and why bootstrapped baselines win
2016 – 2018
Part III · Stable and Scalable Optimization
08
Natural Policy Gradients and TRPO
Fisher information, trust regions, conjugate gradients, and monotone improvement guarantee
2001 – 2015
09
Proximal Policy Optimization
The clipped surrogate, entropy bonus, and why PPO dominates practice
2017
10
Maximum Entropy RL and SAC
Soft Bellman equations, temperature tuning, SAC for continuous control
2017 – 2018
Part IV · RL for Language Models
11
Reward Modeling and RLHF
Bradley-Terry model, preference datasets, the InstructGPT recipe
2017 – 2022
12
PPO for Token Generation
Token-level MDP, KL penalty, reward hacking and the alignment tax
2022 – 2023
13
GRPO — Group Relative Policy Optimization
Deriving GRPO from first principles, group baselines, and reasoning incentives
2024 – 2025
14
DPO and Implicit Reward Methods
Bypassing the RL loop, DAPO, and the convergence between offline RL and preference learning
2023 – 2025