The boring secret to building better AI agents

Andrew Ng pointed out something interesting: the single biggest factor in how fast teams build AI agents isn't using the latest tools or techniques. It's having a disciplined process for measuring performance (evals) and figuring out why things break (error analysis).

He compares it to how musicians don't just play a piece start to finish over and over. They find the tricky parts and practice those specifically. Or how you don't just chase nutrition trends but actually look at your bloodwork to see what's actually wrong. The idea is simple but easy to forget when you're caught up in trying every new AI technique that goes viral on social media.

The tricky part with AI agents is that there are so many more ways things can go wrong compared to traditional machine learning. If you're building something to process financial invoices automatically, the agent could mess up the due date, the amount, the currency, mix up addresses, or make the wrong API call. The output space is huge. Ng's approach is to build a quick prototype first, manually look at where it stumbles, and then create specific tests for those problem areas. Sometimes these are objective metrics you can code up, sometimes you need to use another LLM to judge the outputs. It's more iterative and messy than traditional ML, but that's the point. You need to see where it actually fails in practice before you know what to measure.

This resonates with me because it's the opposite of what feels productive in the moment. When something breaks, you want to jump in and fix it immediately. But Ng's argument is that slowing down to understand the root cause actually speeds you up in the long run. It's boring work compared to playing with new models or techniques, but it's what separates teams that make steady progress from ones that spin their wheels.

Subscribe to The AI Engineering Brief

No spam, no sharing to third party. Only you and me.

Member discussion