Horizon Length - Moore's law for AI Agents

This post from LessWrong critiques the idea of “horizon length” (a benchmark from METR that ranks tasks by how long humans take, and then measures how long AIs can handle) as a kind of Moore’s law for agents. The author argues that using task duration as a proxy for difficulty is unreliable. Different tasks vary in more than just time cost, such as the need for conceptual leaps, domain novelty, or dealing with messy data. Because of that, there’s no clean mapping between “time to human” and “difficulty for an agent.” The benchmark is also biased because it only measures tasks that can be clearly specified and automatically checked, which naturally favour the kinds of problems current AI systems are already good at.

What I found most useful is the caution this offers about overinterpreting neat metrics. It’s tempting to extrapolate from horizon length that AIs will soon take on longer tasks that span hours or days, and from there to assume they’ll automate R&D or cause major disruptions. The author’s point is that even if the trend holds within these benchmarks, it doesn’t necessarily reflect real-world capabilities. For anyone working in AI, this is a useful reminder to always examine how well a proxy aligns with what actually matters, and to watch out for evaluation artefacts that give a false sense of progress.

Read the full post here 👉 https://www.lesswrong.com/posts/PzLSuaT6WGLQGJJJD/the-length-of-horizons

Subscribe to The AI Engineering Brief

No spam, no sharing to third party. Only you and me.

Member discussion