On Durable Objects, Orleans, and prior art for the agentic web
Zak Knill wrote a sharp post this week arguing that LLMs are exposing a gap in our standard cloud-native architecture.
TIL: Ads in AI chatbots are not just a UX problem
TIL from a paper on ads in AI chatbots that putting adverts inside an AI assistant is not the same
Agentic Architecture
Welcome to Middle Loop Engineering
Where engineering rigour goes now that AI writes the code
GPUs
How fast does it serve? Throughput, latency, and picking the right GPU
Part 2 of 2 on inference engineering for AI engineers.
LLM Scaling
Fitting LLMs on Self-Hosted GPUs
How much VRAM does your LLM need, and which GPU should you actually rent? A free calculator covering DeepSeek, Llama, Mixtral on H100, B200, A100.
Claude Code
The Harness Is the Product
Where does product quality live in an LLM-based system? A leaked source and a detailed postmortem, both from Anthropic in the last four weeks, make the answer unusually concrete.
How "Thinking" Models Actually Work
Lilian Weng's Why We Think is a survey of test-time compute and chain-of-thought reasoning. Here's what I pulled out of it.
Agentic Architecture
Harness Engineering: The Outer System That Makes Agents Reliable
Building a good harness is what separates a good agentic implementation from a great one.
We’re Being Too Loose With the Term “World Model”
I think we are still too loose with the phrase “world model”.
TIL: Quantisation
Quantisation is really a precision-allocation problem.
Claude Code
Write Skills Like Workstations, Not Prompts
Claude Code skills work best when you treat them as workstations, not prompts: folders with scripts, gotchas, templates, and progressive disclosure that manage the agent's attention budget at runtime.
How the Claude Code team designs agent tools
(Part of my Today I Learned series. Short posts on things that made me think.)
When Claude Code shipped, the
Make Claude Code Review Its Own Plans
(Part of my TIL series)
If you've used Claude Code's plan mode, you've probably
eval driven development
Ship Prompts Like Software: Regression Testing for LLMs
Because "it seemed fine when I tested it" is not a deployment strategy.
Part 4 of 4: Evaluation-Driven Development for LLM Systems
evals
Four Ways to Grade an LLM (Without Going Broke)
Your evaluation technique should match the question you're asking, not your ambition.
eval driven development
Your Golden Dataset Is Worth More Than Your Prompts
Most teams spend weeks perfecting prompts and minutes on evaluation data. That's backwards.
Part 2 of 4: Evaluation-Driven Development for LLM Systems
evals
Build LLM Evals You Can Trust
If five correct responses are enough to ship an LLM feature, what are you actually measuring: quality, or luck?
Part 1 of 4: Evaluation-Driven Development for LLM Systems
Stripe's coding agents: the walls matter more than the model
(Part of my Today I Learned series)
Stripe merges over 1,300 AI-written pull requests every week, and almost every