Antonio Franca

Reinforcement Learning

From Bellman equations to modern alignment — the historical arc of sequential decision making.

Part I · The Classical Foundation
01
The Bellman Principle State, action, reward, return, and the Bellman equations as a recursive decomposition of value
1957 – 1972
02
Dynamic Programming Policy evaluation, policy iteration, value iteration, and their exact-but-intractable cost
1957 – 1984
03
Temporal Difference Learning TD(0), TD(λ), the bootstrapping idea, SARSA and Q-learning
1988 – 1994
04
Tabular Q-Learning and Convergence The Q-learning update, proof of convergence, and the tabular regime's limits
1989 – 1995
Part II · The Deep Learning Fusion
05
Deep Q-Networks Experience replay, target networks, the Atari result, and the function approximation pathology
2013 – 2015
06
The Policy Gradient Theorem REINFORCE, score function estimators, baseline subtraction, variance reduction
1992 – 2000
07
Actor-Critic and Advantage Estimation A3C/A2C, GAE, and why bootstrapped baselines win
2016 – 2018
Part III · Stable and Scalable Optimization
08
Natural Policy Gradients and TRPO Fisher information, trust regions, conjugate gradients, and monotone improvement guarantee
2001 – 2015
09
Proximal Policy Optimization The clipped surrogate, entropy bonus, and why PPO dominates practice
2017
10
Maximum Entropy RL and SAC Soft Bellman equations, temperature tuning, SAC for continuous control
2017 – 2018
Part IV · RL for Language Models
11
Reward Modeling and RLHF Bradley-Terry model, preference datasets, the InstructGPT recipe
2017 – 2022
12
PPO for Token Generation Token-level MDP, KL penalty, reward hacking and the alignment tax
2022 – 2023
13
GRPO — Group Relative Policy Optimization Deriving GRPO from first principles, group baselines, and reasoning incentives
2024 – 2025
14
DPO and Implicit Reward Methods Bypassing the RL loop, DAPO, and the convergence between offline RL and preference learning
2023 – 2025