How fast does it serve? Throughput, latency, and picking the right GPU
Part 2 of 2 on inference engineering for AI engineers.
Fitting LLMs on Self-Hosted GPUs
How much VRAM does your LLM need, and which GPU should you actually rent? A free calculator covering DeepSeek, Llama, Mixtral on H100, B200, A100.
How "Thinking" Models Actually Work
Lilian Weng's Why We Think is a survey of test-time compute and chain-of-thought reasoning. Here's what I pulled out of it.
Harness Engineering: The Outer System That Makes Agents Reliable
Building a good harness is what separates a good agentic implementation from a great one.
We’re Being Too Loose With the Term “World Model”
I think we are still too loose with the phrase “world model”.
Ship Prompts Like Software: Regression Testing for LLMs
Because "it seemed fine when I tested it" is not a deployment strategy.
Part 4 of 4: Evaluation-Driven Development for LLM Systems