The post aims to provide an intuitive perspective for RL for LLMs. We aim to keep things simple, with no math or implementation details. One may refer to the appendix at the end for a more detailed and principled treatment.
There are so many amazing things about the current paradigm like the elegance of GRPO advantage calculations or Rubrics and LLMs-as-judges. We shall treat those some other day.
At a meta-level, it's perhaps useful to trace back how we got here and question why we do things the way we do. Especially for relatively-new entrants like me who weren't around for "the first wave of RL", or don't have principled intuitions to intuitively understand SSL vs RL. Having no formal baggage helps :)
We may conclude on this note to tie things together:
You are tasked with learning a distribution or a policy that learns to predict the next token well enough. It is obviously unlikely that this policy alone would explicitly elicit abilities that, say act in accordance with a more useful and relevant policy ex. getting an answer right or being aligned with human preferences. Or even more specific abilities like "outputting a sentence in reverse, character-level".
The good thing is that a pre-trained base has extremely strong priors that we can bootstrap these specific abilities out of.
Self-supervised learning is fundamentally distribution learning. During pre-training, the model builds an approximation of the statistical patterns in their training distribution - call this target distribution T. When given input text, the model contextualizes where that input sits within its learned distribution and generates continuations that follow the natural statistical flow of that space. This is what naive next-token prediction accomplishes: it teaches the model to follow the probability gradients toward the most statistically likely completions. It happens so that this naive self-supervised learning stage imbues the model with many implicit "emergent" abilities. This is a property that emerges from the nature of the distribution being learnt. Perhaps that deserves a post of its own. It's an obvious insight but an elegant one.
The major gripe then is that increasing statistical likeliness of outputting a target distribution (internet-scale text corpuses) means little for developing "useful" distributions. Simple illustration: The LLM is a map and you want to elicit a specific feature, say a simple factual query like "What is the capital of France?", there is little chance this query works and elicits a factual and helpful answer on a pre-trained base. The pre-trained bases will simply predict tokens that are likeliest to lead that sentence, however it appears within training data. It is post-training stages like instruction tuning or RLHF that somewhat nudge the distribution so the natural flow of information through LLMs leads to something more than just the likeliest next token.
Learning to predict the next-token on a grade-school math textbook does not guarantee a test-time ability to get questions of those levels right. The good thing is that some latent abilities or representations emerge and the question is about developing generalised methods that elicit them well.
(Post)-training methods like RL don't replace this distributional knowledge but they redirect it. Instead of following the natural gradients of T toward high-probability regions, RL trains the model to follow gradients toward different regions that score higher on human preference models. This creates an interesting tension: the model's core capacity comes from learning T through massive pre-training compute, but its usefulness requires systematically steering away from T's natural flow patterns toward completions that better serve human goals.