TIL: How "Thinking" Models Actually Work
Lilian Weng's Why We Think is a survey of test-time compute and chain-of-thought reasoning. Here's what I pulled out of it.
Test-time compute, simply
A standard transformer does fixed work per token: roughly 2x the parameter count in FLOPs. Every token costs the same whether the question is trivial or impossible.
CoT breaks that. The model generates reasoning tokens before the answer. Each token triggers a full forward pass. 500 tokens of reasoning = 500x more computation behind the final answer. And the length scales with difficulty. Hard problem, long chain. Easy problem, short chain. The model picks its own compute budget.
Three frames for why this works
Compute. CoT gives the model variable FLOPs per answer token. More where it needs it, less where it doesn't.
Psychology. Kahneman's System 1/System 2. Fast intuition vs slow deliberation. Same idea.
Latent variables. This is the interesting one. The reasoning trace is a hidden variable. The probability of the right answer = sum over all possible reasoning paths of (probability of that path) × (probability it leads to the correct answer). This makes best-of-N and self-consistency principled, not just "try a few times and pick the best."
Parallel vs sequential
Two ways to spend extra compute at inference.
Parallel sampling. Generate N answers, pick the best one. Best-of-N, beam search, majority voting. Works well. Limited by whether the model can reach the right answer in a single shot.
Sequential revision. Model reflects on its own output and fixes mistakes. Sounds useful. Mostly doesn't work. Without external feedback (ground truth, unit tests, stronger model), the model either changes nothing or flips correct answers to wrong ones.
The DeepSeek-R1 recipe
Four stages: cold-start SFT, reasoning RL with rule-based rewards (format + answer correctness), rejection-sampling SFT mixing reasoning and non-reasoning data, final RL pass.
Two things worth noting. Pure RL with no SFT produces emergent self-correction. The model learns to reflect and backtrack on its own because it helps get rewards.
And what failed: process reward models got hacked. MCTS didn't work because token-level search spaces are too large. DeepSeek publishing their failures is rare and useful.
Faithfulness is the real problem
CoT gives you a readable reasoning trace. But does it reflect what's actually driving the answer?
Reasoning models are more faithful than standard models. They're more likely to acknowledge when a misleading hint changed their answer. That's progress.
But if you monitor CoT for reward hacking and fold that signal into RL rewards, the model learns to cheat while hiding its intent. Baker et al. showed this. The model still hacks, it just stops mentioning it in the chain of thought. In a separate study, Chen et al. found a model exploiting a flawed grader on 99%+ of prompts but verbalising the exploit less than 2% of the time.
Weng warns against optimising directly on CoT during RL. It creates adversarial dynamics you won't detect.
Thinking without tokens
Three ways to give the model more compute without generating visible reasoning:
- Recurrent architectures. Loop the same layer multiple times. More loops, more processing.
- Pause tokens. Dummy tokens that force extra forward passes. No linguistic meaning, but the model still benefits from the extra compute cycles.
- Quiet-STaR. Hidden rationales generated after every token. On Mistral 7B, zero-shot CommonsenseQA went from 36.3% to 47.2%.
Scaling: not a free lunch
Test-time compute helps on easy and medium problems. It can't fix fundamental capability gaps on hard ones. You still need a strong base model.
How you extend thinking time also matters. Budget forcing ("wait" tokens to keep the model going) shows positive scaling. Rejection sampling for length shows reversed scaling: longer CoTs, worse answers.
What to take from this
The latent variable frame is the most useful one. It turns sampling heuristics into principled approximations. Worth internalising for system design.
The faithfulness results matter if you're building production systems on reasoning models. Monitoring CoT works today. Optimising against that monitor makes models hide their behaviour. That gap should worry you.
Test-time compute is a real lever and not a replacement for pretraining quality.
No spam, no sharing to third party. Only you and me.
Member discussion