Some intuitions for RL post-training on self-supervised base models

August 2025

Meta: Quick thoughts off the top of my head, mostly just simple and straight-forward intuitions. (Written and shared privately: August 2025, Released: February 2026). The surrogate objective of the post is to convince people RL for LLMs goes much broader than just CoT/Reasoning, which was maybe a happy accident?

Table of Contents

The post aims to provide an intuitive perspective for RL for LLMs. We aim to keep things simple, with no math or implementation details. One may refer to the appendix at the end for a more detailed and principled treatment.

There are so many amazing things about the current paradigm like the elegance of GRPO advantage calculations or Rubrics and LLMs-as-judges. We shall treat those some other day.

At a meta-level, it's perhaps useful to trace back how we got here and question why we do things the way we do. Especially for relatively-new entrants like me who weren't around for "the first wave of RL", or don't have principled intuitions to intuitively understand SSL vs RL. Having no formal baggage helps :)

Some unordered perspectives

Self-supervised learning worked amazingly well. The specific form of it that worked was auto-regressive next-token prediction.
The model learns to predict the next token in the sequence. Doing so it learns something more fundamental about the target distribution.
These "fundamental learnings" manifest as latent features that need better methods like instruction tuning and RLHF to be elicited.
In short, cross-entropy reduction was a good enough fundamental objective to optimise for, and guarantees some downstream (latent) abilities on more meaningful objectives ("Can it do task-xyz that matters to me in the real world?")
Conventionally, RL works on exact objectives. "Does it get this answer right?", "Does it solve the puzzle?", "Can it navigate its way through this maze?". This is in obvious contrast to the supreme generality of self-supervised learning.
We want reasonably finite search spaces for RL to work. If nothing, we want useful priors.
The way to reconcile these contrasts is this: You come with a vision of a garden with specific flowers and plants you aim to grow. SSL is akin to an undirected pollination + fertilisation stage. Internet-scale pre-training means you have many pollination vectors.
This means that the flowers you want may grow spontaneously. But more importantly, the ground is fertile and the previous pollination means there is an eco-system for you to grow other plants.
Internet-scale SSL is a good pollinator. It will spontaneously pollinate much of value. If not, it will at least bootstrap an eco-system out of nothing.
What does it mean for us in practice? It means that if you have a strong base model, and you have verifiable objectives, RL will work and your model can "hillclimb" to be competent w.r.t. that objective.
This means once you have a strong pre-trained model with a very general data distribution, you can meaningfully augment the output distribution in whatever direction you want.
Example: I can train a small Qwen on the task of outputting a sentence character-wise, in reverse.
CoT-like reasoning is, but the result of such objectives. We start shaping the output distribution of our reasonably competent base to do well on Math and suchlike verifiable STEM problems.
The reward signal is verifiable. It will be backpropagated. The model will hillclimb the objective. And the plants that you want will grow.

We may conclude on this note to tie things together:

You are tasked with learning a distribution or a policy that learns to predict the next token well enough. It is obviously unlikely that this policy alone would explicitly elicit abilities that, say act in accordance with a more useful and relevant policy ex. getting an answer right or being aligned with human preferences. Or even more specific abilities like "outputting a sentence in reverse, character-level".

The good thing is that a pre-trained base has extremely strong priors that we can bootstrap these specific abilities out of.

Appendix

Self-supervised learning is fundamentally distribution learning. During pre-training, the model builds an approximation of the statistical patterns in their training distribution - call this target distribution T. When given input text, the model contextualizes where that input sits within its learned distribution and generates continuations that follow the natural statistical flow of that space. This is what naive next-token prediction accomplishes: it teaches the model to follow the probability gradients toward the most statistically likely completions. It happens so that this naive self-supervised learning stage imbues the model with many implicit "emergent" abilities. This is a property that emerges from the nature of the distribution being learnt. Perhaps that deserves a post of its own. It's an obvious insight but an elegant one.

The major gripe then is that increasing statistical likeliness of outputting a target distribution (internet-scale text corpuses) means little for developing "useful" distributions. Simple illustration: The LLM is a map and you want to elicit a specific feature, say a simple factual query like "What is the capital of France?", there is little chance this query works and elicits a factual and helpful answer on a pre-trained base. The pre-trained bases will simply predict tokens that are likeliest to lead that sentence, however it appears within training data. It is post-training stages like instruction tuning or RLHF that somewhat nudge the distribution so the natural flow of information through LLMs leads to something more than just the likeliest next token.

Learning to predict the next-token on a grade-school math textbook does not guarantee a test-time ability to get questions of those levels right. The good thing is that some latent abilities or representations emerge and the question is about developing generalised methods that elicit them well.

(Post)-training methods like RL don't replace this distributional knowledge but they redirect it. Instead of following the natural gradients of T toward high-probability regions, RL trains the model to follow gradients toward different regions that score higher on human preference models. This creates an interesting tension: the model's core capacity comes from learning T through massive pre-training compute, but its usefulness requires systematically steering away from T's natural flow patterns toward completions that better serve human goals.