evals
Build LLM Evals You Can Trust
If five correct responses are enough to ship an LLM feature, what are you actually measuring: quality, or luck?
Part 1 of 4: Evaluation-Driven Development for LLM Systems
Reasoning LLMs are wanderers rather than systematic explorers
It turns out that when we ask reasoning capable models such as the latest LLMs (GPT-5 family, Claude Opus and
The boring secret to building better AI agents
Andrew Ng pointed out something interesting: the single biggest factor in how fast teams build AI agents isn't