TIL: The real bottleneck in AI coding isn't speed
Both Anthropic and OpenAI shipped "fast inference" this week, and their approaches reveal two very different bets. Anthropic serves the exact same Opus 4.6 model at 2.5x the speed for 6x the cost. The trick is straightforward: smaller batch sizes on GPUs, so your request doesn't wait around for other users' prompts to fill the queue. OpenAI partnered with Cerebras to run a new, smaller model called Codex-Spark on a single wafer-scale chip with 44GB of on-chip SRAM. That chip is 57 times the size of an H100. Over 1,000 tokens per second, roughly 15x faster than standard. But 44GB of SRAM can only fit a model around 20-40B parameters, so Spark is a distilled, less capable version of Codex. OpenAI had to build a worse model to make the hardware work. Cerebras confirmed the WSE-3 specs in their own announcement, and SambaNova has pointed out that the architecture's lack of off-chip memory forces exactly this kind of model-size constraint.
So which approach wins? Probably neither, because speed might not matter as much as we assume. Sean Goedecke makes a sharp observation: the usefulness of AI agents is dominated by how few mistakes they make, not raw throughput. Buying 15x the speed at the cost of 20% more errors is a bad trade, because most of a developer's time is spent handling mistakes, not waiting. He notes that Cursor's hype dropped away around the same time they shipped their own fast-but-less-capable agent model. Speed is easy to sell. Accuracy is hard to build.
Source: Sean Goedecke, "Two different tricks for fast LLM inference"
No spam, no sharing to third party. Only you and me.
Member discussion