Reinforcement Learning

From Bellman equations to modern alignment — the historical arc of sequential decision making.

Part I · The Classical Foundation

The Bellman Principle State, action, reward, return, and the Bellman equations as a recursive decomposition of value

Dynamic Programming Policy evaluation, policy iteration, value iteration, and their exact-but-intractable cost

Temporal Difference Learning TD(0), TD(λ), the bootstrapping idea, SARSA and Q-learning

Tabular Q-Learning and Convergence The Q-learning update, proof of convergence, and the tabular regime's limits

Part II · The Deep Learning Fusion

Deep Q-Networks Experience replay, target networks, the Atari result, and the function approximation pathology

The Policy Gradient Theorem REINFORCE, score function estimators, baseline subtraction, variance reduction

Actor-Critic and Advantage Estimation A3C/A2C, GAE, and why bootstrapped baselines win

Part III · Stable and Scalable Optimization

Natural Policy Gradients and TRPO Fisher information, trust regions, conjugate gradients, and monotone improvement guarantee

Proximal Policy Optimization The clipped surrogate, entropy bonus, and why PPO dominates practice

Maximum Entropy RL and SAC Soft Bellman equations, temperature tuning, SAC for continuous control

Part IV · RL for Language Models

Reward Modeling and RLHF Bradley-Terry model, preference datasets, the InstructGPT recipe

PPO for Token Generation Token-level MDP, KL penalty, reward hacking and the alignment tax

GRPO — Group Relative Policy Optimization Deriving GRPO from first principles, group baselines, and reasoning incentives

DPO and Implicit Reward Methods Bypassing the RL loop, DAPO, and the convergence between offline RL and preference learning