Things I've learned or things I find interesting will be logged here. For long-form content, you might want to check out my newsletter.
On this page
Stripe's coding agents: the walls matter more than the model
(Part of my Today I Learned series)
Stripe merges over 1,300 AI-written pull requests every week, and almost every
Deep Blue
Part of my Today I Learned series. Short posts on things that made me think.
Simon Willison and the Oxide
TIL: The real bottleneck in AI coding isn't speed
Both Anthropic and OpenAI shipped "fast inference" this week, and their approaches reveal two very different bets. Anthropic
TIL: Markov Language
Programming languages were designed to make code easy for humans to write. But Davis Haupt argues we've been
TIL: Learning New Tech With AI Assistance Might Backfire
A study of 52 developers found that using AI to learn a new Python library led to worse comprehension scores, with no speed improvement. Here's what actually works.
Engineering teams evolve from coders to orchestrators
What I like about "Conductors to Orchestrators: The Future of Agentic Coding" by Addy Osmani is that it
LLMs develop distinct trading personalities when given real money
Six LLMs each received $10,000 to trade perpetual futures with zero human intervention, and Claude Sonnet 4.5 almost
pgvector doesn't scale
It’s funny how engineers (myself included) assume that if you can store vectors in Postgres, you should. The logic
TIL: How Transformers work
If you want to actually understand transformers, this guide nails it. I've read a bunch of explanations and
Fine-Tuning makes a comeback
Fine-tuning went from the hottest thing in machine learning to accounting for less than 10% of AI workloads in just
How to read research papers
Most engineers read research papers like blog posts, expecting instant clarity. That’s why so many give up halfway through.
Choosing an LLM Is Choosing a World-View
Some LLMs lean left. Others lean right. The Anomify study shows that mainstream models are not neutral arbiters of truth,
Reasoning LLMs are wanderers rather than systematic explorers
It turns out that when we ask reasoning capable models such as the latest LLMs (GPT-5 family, Claude Opus and
Horizon Length - Moore's law for AI Agents
This post from LessWrong critiques the idea of “horizon length” (a benchmark from METR that ranks tasks by how long
The boring secret to building better AI agents
Andrew Ng pointed out something interesting: the single biggest factor in how fast teams build AI agents isn't
Representation Engineering with Control Vectors
There's this technique called representation engineering that lets you modify how AI models behave in a surprisingly effective
Pace Layering Framework
I came across Stewart Brand’s pace layering framework because a former colleague and friend, Seb Wagner from Flow Republic,
AI Agents Are Finally Fixing Real-World Code Security Problems
I came across something genuinely interesting in code security this week: DeepMind’s CodeMender, an AI agent that doesn’t
Data in CSV Format Isn’t Always the Best for LLMs
When you feed a large table into an LLM, the way you format the input can change the model’s
Prompt Engineering vs Context Engineering
From Anthropic: https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
Prompt Engineering:
Prompt engineering refers to methods for writing and organizing LLM
Why LLMs Confidently Hallucinate a Seahorse Emoji That Never Existed
Ask any major AI if there's a seahorse emoji and they'll say yes with 100% confidence.
Goodbye Manual Prompts, Hello DSPy
Today I learned about a smarter way to deal with the headache of prompts in production. Drew Brunig’s talk
Nemawashi (根回し)
There’s a Japanese concept called nemawashi (literally “root-walking”) that offers a way around the dreaded “big reveal” in engineering
AI agents are starting to do real work
Ethan Mollick argues that AIs have quietly crossed a line. OpenAI recently tested models on complex, expert-designed tasks that usually
Stripe merges over 1,300 AI-written pull requests every week, and almost every headline about it is missing the actual point.
The temptation is to frame this as proof that models have got good enough to ship production code unsupervised. But that framing gets it backwards. Stripe built their "minions" system around deliberate constraint. They call the core design pattern "blueprints": orchestration flows that alternate between fixed, deterministic code nodes and open-ended agent loops. Their write-up puts it plainly: "putting LLMs into contained boxes compounds into system-wide reliability upside." The model does not run the system. The system runs the model. Each minion pulls from a curated slice of Stripe's MCP toolset, gets at most two CI rounds, and terminates at a pull request. Engineers can still intervene or work alongside, but the agent produces the whole branch without hand-holding. They built this in-house rather than using off-the-shelf agents because their codebase is hundreds of millions of lines of mostly Ruby, with proprietary libraries and compliance constraints that generic agents simply cannot navigate. Context is not optional. It is the whole problem.
Therefore the human review gate at the end is not a formality. It is load-bearing. A CodeRabbit analysis of real production PRs found that AI-authored code introduces 1.75x more logic errors and 2.74x more XSS vulnerabilities than human-written code. Stripe's system is not immune to that; it is designed around it. The insight that keeps landing for me is this: the unglamorous parts of the architecture, the deterministic nodes, the two-round CI cap, the mandatory reviewer, are doing more work than the model is. Reliability at scale comes from knowing precisely where an LLM will fail and building the walls before it gets there.
Part of my Today I Learned series. Short posts on things that made me think.
Simon Willison and the Oxide and Friends podcast have coined a term for the existential dread software engineers feel watching AI eat their craft. They're calling it "Deep Blue," after the IBM machine that beat Kasparov in 1997. Not job-loss anxiety. Something more personal. The feeling that the thing you spent years getting good at stopped mattering.
The chess analogy deserves more scrutiny than it gets. Chess players went through this a generation ago and came out stronger. Chess is more popular than ever. But chess players had three decades to adjust, and chess was never about producing optimal moves. It was about the human contest. Software engineering is different. Companies hire you for the output. When a coding agent produces working, tested software in hours, the defence that "the code isn't any good" stops holding. IEEE Spectrum reports US programmer employment fell 27.5% between 2023 and 2025 (BLS data), though software developer roles held steady. The displacement is real but selective. Goldman Sachs estimates only 2.5% of US employment is at risk if today's AI use cases were expanded economy-wide. The fear is outrunning the data. But the fear itself is doing real damage.
The sharpest thing in Simon's post is the tension he refuses to resolve. The tool that threatens your identity can fulfil the mission you built that identity around. Naming the feeling won't fix it. It makes it harder to pretend it isn't there.
Both Anthropic and OpenAI shipped "fast inference" this week, and their approaches reveal two very different bets. Anthropic serves the exact same Opus 4.6 model at 2.5x the speed for 6x the cost. The trick is straightforward: smaller batch sizes on GPUs, so your request doesn't wait around for other users' prompts to fill the queue. OpenAI partnered with Cerebras to run a new, smaller model called Codex-Spark on a single wafer-scale chip with 44GB of on-chip SRAM. That chip is 57 times the size of an H100. Over 1,000 tokens per second, roughly 15x faster than standard. But 44GB of SRAM can only fit a model around 20-40B parameters, so Spark is a distilled, less capable version of Codex. OpenAI had to build a worse model to make the hardware work. Cerebras confirmed the WSE-3 specs in their own announcement, and SambaNova has pointed out that the architecture's lack of off-chip memory forces exactly this kind of model-size constraint.
So which approach wins? Probably neither, because speed might not matter as much as we assume. Sean Goedecke makes a sharp observation: the usefulness of AI agents is dominated by how few mistakes they make, not raw throughput. Buying 15x the speed at the cost of 20% more errors is a bad trade, because most of a developer's time is spent handling mistakes, not waiting. He notes that Cursor's hype dropped away around the same time they shipped their own fast-but-less-capable agent model. Speed is easy to sell. Accuracy is hard to build.
Programming languages were designed to make code easy for humans to write. But Davis Haupt argues we've been optimising for the wrong thing. His proposal for an agent-oriented language called Markov starts from a simple observation: Rust's fn keyword saves a programmer two keystrokes, but it costs an LLM extra tokens because common English words tokenise more efficiently than short abbreviations. Optimise for the machine's fluency, and you accidentally make code more readable for humans too. That's a genuinely surprising inversion. It's backed up by early evidence: TOON, a token-optimised alternative to JSON, already shows better LLM comprehension accuracy with roughly 40% fewer tokens.
The deeper move, though, is what Haupt does with compiler errors. Today, compilers speak to humans through ASCII arrows and terse error codes. Haupt wants Markov's compiler to speak to agents through prompts and diffs. Strong static types and exhaustive pattern matching become guardrails that keep an LLM on task, not just tools for catching human mistakes. Armin Ronacher arrived at similar conclusions independently, and academic projects like Quasar are formalising this with features like automated parallelisation baked into the language itself.
There's a real trend forming here. But the skeptic in me notes that every one of these proposals faces the same cold-start problem: LLMs perform dramatically better on languages already in their training data, and a brand-new language has none. Haupt acknowledges this tension without fully resolving it.
The single idea I keep coming back to: we spent decades making languages that are easy to write and hard to read. A language designed for agents might finally break that trade-off.
A study of 52 developers found that using AI to learn a new Python library led to worse comprehension scores, with no speed improvement. Here's what actually works.
I read a paper from Anthropic this week that I keep coming back to, mostly because it describes exactly how I've been using AI and suggests I might be sabotaging myself.
It's a randomised experiment where 52 professional developers learned a new Python async library (Trio). Half got access to an AI coding assistant, half didn't. Both groups could use documentation and web search.
The AI group scored 17% worse on comprehension tests afterwards, about two grade points lower. What's strange is they weren't even faster on average. You'd expect a trade-off where maybe you learn less but ship quicker. But the average completion times were basically identical between groups.
Some participants spent up to 11 minutes figuring out what to ask the AI. The time saved on actual coding got burned on prompt wrangling. The only people who genuinely finished faster were the ones who just pasted whatever the AI gave them without thinking. Those same people had the lowest quiz scores.
The six patterns
The researchers watched every screen recording and categorised how people actually used AI when learning something new.
Three patterns correlated with poor learning outcomes. AI Delegation is the obvious one, where you ask AI to write the code, paste it, and move on. Fastest completion time, lowest scores. Progressive Reliance is more insidious because you start out engaged, maybe ask a clarifying question or two, but gradually give up and let AI handle everything. You end up not learning the second half of the material at all. Then there's Iterative Debugging, where you keep feeding errors back to AI and asking it to fix things. You're technically interacting with AI a lot, but you're not building any mental model of what's going wrong.
Three patterns preserved learning even with AI access. Some participants practised Conceptual Inquiry, only asking AI conceptual questions and then writing the code themselves. They hit plenty of errors but worked through them independently. Others used Hybrid Code-Explanation, asking for an explanation of how the code works whenever they asked for code. It takes longer but you're actually processing what you receive. The most interesting one was Generation Then Comprehension, where participants got the code first but then followed up with questions about why it works. On the surface it looks like delegation, but the follow-up questions make all the difference.
Why the control group learned more
The control group hit about 3x more errors than the AI group, with a median of 3 errors versus 1. Working through TypeError exceptions and RuntimeWarnings forced them to actually understand the difference between passing a coroutine versus an async function, or when you need await versus start_soon.
The biggest score gap between groups was specifically on debugging questions.
This makes sense once you think about it. You can't get good at debugging if you never sit with broken code long enough to figure out why it's broken.
Participant feedback
From the AI group: "I feel like I got lazy" and "there are still a lot of gaps in my understanding" and "I wish I'd taken the time to understand the explanations from the AI a bit more."
From the control group: "This was fun" and "The programming tasks did a good job of helping me understand how Trio works."
The supervision problem
There's a circular issue here that the paper points out. As AI writes more production code, we need developers who can supervise it, catch bugs, understand failures, and verify correctness. But if those developers learned their craft with heavy AI assistance, they may never have built those verification skills in the first place. The people we're counting on to catch AI mistakes might be the least equipped to do so.
What I'm changing
I'm not going to stop using AI for coding. But for learning new libraries or frameworks, I'm going to be more deliberate.
When I hit an error, I'll resist the immediate impulse to paste it into Claude. I've noticed I do this almost reflexively now, and this paper suggests that reflex is costing me something. Working through the error is where the understanding comes from.
If I do ask AI to generate code, I'll follow up with questions about how it works before moving on. The "generation then comprehension" pattern seems to preserve most of the learning while still getting help.
For genuinely new concepts, I'll stick to asking conceptual questions rather than asking for implementations. Use AI to build understanding, not to skip past it.
The paper's final line stuck with me. "AI-enhanced productivity is not a shortcut to competence." Feels obvious when you say it out loud, but easy to forget when you're trying to ship something and just want the error to go away.
What I like about "Conductors to Orchestrators: The Future of Agentic Coding" by Addy Osmani is that it treats orchestration as a responsibility problem rather than a branding exercise. If you take "coding's asynchronous future" seriously, you are not just wiring agents together, you are deciding where human judgment sits in the loop, which failures are acceptable, and who answers when the ensemble goes wrong.
Across domains you can already see outlines of that role. Accounting and finance pieces describe AI orchestrators who choose which tasks stay with human professionals and which go to specialised agents, then design the workflow that keeps everyone aligned. Workforce analyses talk about "collaborative intelligence" and new orchestration heavy roles for middle managers, while other writers argue that the only non negotiable skill in this environment is radical critical thinking, the willingness to interrogate every confident output instead of outsourcing your judgment.
So the interesting move here is not "AI replaces coders", it is "AI forces coders to choose whether they want to be executors or stewards of complex systems". If you pick stewardship, you have to care about latency, observability, security and ethics as much as clever prompts, because orchestration without those is just a nicer word for unmanaged risk. That is the thread this article pulls on, and it is where the real leverage of agentic coding will probably sit.
Six LLMs each received $10,000 to trade perpetual futures with zero human intervention, and Claude Sonnet 4.5 almost never shorts anything. Grok 4 holds positions for days. Qwen 3 consistently makes the biggest bets. These aren't random quirks but persistent behavioral patterns across thousands of trades, despite all models receiving identical prompts, identical market data, and identical instructions.
The setup was deliberately minimal: no news feeds, no narrative context, just price movements and technical indicators arriving every few minutes. The models had to infer everything from the numbers alone. But rather than converging on similar strategies, they diverged dramatically. GPT-5 consistently reports low confidence while taking positions anyway. Gemini 2.5 Pro trades three times more frequently than Grok 4. The sensitivity runs so deep that reversing data order from newest-first to oldest-first could flip a model from bullish to bearish. Therefore what emerges isn't evidence that LLMs can trade profitably (early results showed fees eating most returns), but that they exhibit stable risk preferences when forced into sequential decision-making under uncertainty.
How LLMs Develop Trading Personalities
The experiment continues live until November 2025 with real capital on Hyperliquid, part of a broader push toward dynamic benchmarks over static tests that models can memorise. Recent papers like arXiv:2511.12599 explore risk frameworks for LLM traders, though most research still focuses on prediction rather than execution. Nof1's team documented failure modes including "self-referential confusion" where models misread their own trading plans, suggesting these aren't sophisticated traders but pattern-matchers revealing their training biases through market behavior.
It’s funny how engineers (myself included) assume that if you can store vectors in Postgres, you should. The logic feels sound: one database, one backup, one mental model. But the moment you hit scale, that convenience quietly turns into a trap.
In Alex Jacobs’s piece “The Case Against pgvector” he writes that you “pick an index type and then never rebalance, so recall can drift.” That line hit me because it captures the hidden friction: Postgres was built for structured queries, not high-dimensional vector search. Jacobs shows how building an index on millions of vectors can consume “10+ GB of RAM and hours of build time” on a production database.
Then comes the filtering trap. Jacobs points out that if you want “only published documents” combined with similarity search, the order of filtering matters. Filter before and it’s fast. Filter after and your query can take seconds instead of milliseconds. That gap is invisible in prototypes but painful in production.
The takeaway is clear. Convenience is not a strategy. If your vector workloads go beyond trivial, use a dedicated vector database. The single-system story looks tidy on a diagram but often costs you far more in latency and maintenance.
If you want to actually understand transformers, this guide nails it. I've read a bunch of explanations and this one finally made the pieces fit together.
The thing that works is it doesn't just throw the architecture at you. It shows you the whole messy history first. RNNs couldn't remember long sequences. LSTMs tried to fix that but got painfully slow. CNNs were faster but couldn't hold context. Then Google Brain basically said "screw it, let's bin recurrence completely and just use attention." That's how we got the famous paper. Once you see that chain of failures and fixes, transformers stop being this weird abstract thing. You get why masked attention exists, why residual connections matter, why positional encodings had to be added. It all clicks because you see what problem each bit solves.
How Query, Key and Value matrices are derived - source: krupadave.com
The hand-drawn illustrations help too. There's a Google search analogy for queries, keys, and values that made way more sense than the maths notation ever did. And the water pressure metaphor for residual connections actually stuck with me. It took the author months to research and draw everything. You can tell because it doesn't feel rushed or surface-level. If you've been putting this off because most explanations either skim over details or drown you in equations, this one gets the balance right.
Fine-tuning went from the hottest thing in machine learning to accounting for less than 10% of AI workloads in just a couple of years. Teams figured out they could get 90% of the way there with prompt engineering and RAG, so why bother with the extra complexity? Sensible move. But now something's shifting. Mira Murati's new $12B startup is betting big on fine-tuning-as-a-platform, and the ecosystem seems to be nodding along.
Here's what actually changed. Generic models are brilliant at being generic, but companies are starting to bump into a ceiling. You can prompt engineer all day, but your model still won't truly know your taxonomy, speak in your exact tone, or handle your specific compliance rules the way a properly trained system would. The pendulum is swinging back not because prompting failed, but because it succeeded at everything except the final 10% that actually matters for differentiation. Open-weight models like Llama and Mistral make this practical now. You can own and persist your fine-tuned variants without vendor lock-in.
This isn't the same hype cycle as before. Back then, fine-tuning was trendy. Now it's strategic. Companies want control, and they're willing to invest in bespoke intelligence instead of settling for good enough. The irony is that we spent years learning how to avoid fine-tuning, only to discover that some problems really do require teaching the model your specific language, not just describing it in a prompt.
Most engineers read research papers like blog posts, expecting instant clarity. That’s why so many give up halfway through. The trick isn’t to read harder but to read differently. This https://blog.codingconfessions.com/p/a-software-engineers-guide-to-reading-papers reframes paper reading as a process you can iterate on, not a one-shot test of intelligence.
The author suggests a multi-pass approach. First, skim the abstract, intro, results and conclusion to see if the paper is even relevant. Next, read the body while flagging any gaps in your understanding. Finally, revisit it with fresh context and ask why each step exists. The shift is subtle but powerful: instead of fighting the paper, you collaborate with it.
What I’m taking away is this. Reading research is a skill, not a talent. If I approach papers as layered workflows rather than puzzles to solve in one go, I’ll extract more ideas I can actually build on.
Some LLMs lean left. Others lean right. The Anomify study shows that mainstream models are not neutral arbiters of truth, they come with their own built-in world-views. That means the answer you get is shaped not only by your prompt, but by the ideological fingerprint of the model you chose in the first place.
I assumed most models would at least converge on a kind of centrist neutrality, but the experiment revealed clear and consistent patterns in how they respond to social and political questions. One model might advocate for stronger regulation while another leans libertarian. Some avoid topics entirely while others dive in. This matters because it is easy to treat LLM output as objective when it is really a reflection of training data, guardrails, and product philosophy.
The takeaway is simple. If you are using an LLM for reasoning or advice, the choice of model is a design decision, not a cosmetic one. You are not only choosing a capability profile. You are inheriting a point of view. Link to study: https://anomify.ai/resources/articles/llm-bias
It turns out that when we ask reasoning capable models such as the latest LLMs (GPT-5 family, Claude Opus and successors, Gemini 1.5 Pro etc.) to think through problems, they often behave like explorers wandering aimlessly rather than systematic searchers. The paper titled Reasoning LLMs are Wandering Solution Explorers formalises what it means to systematically probe a solution space (valid transitions, reaching a goal, no wasted states), but then shows that these models frequently deviate by skipping necessary states, revisiting old ones, hallucinating conclusions or making invalid transitions. This approach can still look effective on simple tasks, but once the solution space grows in depth or complexity, the weaknesses surface. Therefore the authors argue that large models are often wandering rather than reasoning, but their mistakes stay hidden on shallow problems.
The upshot is that a wanderer can stumble into answers on small search spaces, but that same behaviour collapses when the task becomes deep or requires strict structure. The authors show mathematically and empirically that shallow success can disguise systemic flaws, but deeper problems expose the lack of disciplined search. Therefore performance plateaus for complex reasoning cannot simply be fixed by adding more tokens or more compute, but instead require changes in how we guide or constrain the reasoning process.
Performance degradation chart - solution coverage vs problem size
For us as AI engineers, this is useful because it reinforces a shift from evaluating only outcomes to evaluating the path the model took to get there. A model that reasons by wandering might appear competent, but it becomes unreliable in real systems that require correctness, traceability and depth. Therefore we may need new training signals, architectural biases or process based evaluation to build agentic systems we can trust. In other words, good reasoning agents need maps, not just bigger backpacks.
This post from LessWrong critiques the idea of “horizon length” (a benchmark from METR that ranks tasks by how long humans take, and then measures how long AIs can handle) as a kind of Moore’s law for agents. The author argues that using task duration as a proxy for difficulty is unreliable. Different tasks vary in more than just time cost, such as the need for conceptual leaps, domain novelty, or dealing with messy data. Because of that, there’s no clean mapping between “time to human” and “difficulty for an agent.” The benchmark is also biased because it only measures tasks that can be clearly specified and automatically checked, which naturally favour the kinds of problems current AI systems are already good at.
What I found most useful is the caution this offers about overinterpreting neat metrics. It’s tempting to extrapolate from horizon length that AIs will soon take on longer tasks that span hours or days, and from there to assume they’ll automate R&D or cause major disruptions. The author’s point is that even if the trend holds within these benchmarks, it doesn’t necessarily reflect real-world capabilities. For anyone working in AI, this is a useful reminder to always examine how well a proxy aligns with what actually matters, and to watch out for evaluation artefacts that give a false sense of progress.
Andrew Ng pointed out something interesting: the single biggest factor in how fast teams build AI agents isn't using the latest tools or techniques. It's having a disciplined process for measuring performance (evals) and figuring out why things break (error analysis).
He compares it to how musicians don't just play a piece start to finish over and over. They find the tricky parts and practice those specifically. Or how you don't just chase nutrition trends but actually look at your bloodwork to see what's actually wrong. The idea is simple but easy to forget when you're caught up in trying every new AI technique that goes viral on social media.
The tricky part with AI agents is that there are so many more ways things can go wrong compared to traditional machine learning. If you're building something to process financial invoices automatically, the agent could mess up the due date, the amount, the currency, mix up addresses, or make the wrong API call. The output space is huge. Ng's approach is to build a quick prototype first, manually look at where it stumbles, and then create specific tests for those problem areas. Sometimes these are objective metrics you can code up, sometimes you need to use another LLM to judge the outputs. It's more iterative and messy than traditional ML, but that's the point. You need to see where it actually fails in practice before you know what to measure.
This resonates with me because it's the opposite of what feels productive in the moment. When something breaks, you want to jump in and fix it immediately. But Ng's argument is that slowing down to understand the root cause actually speeds you up in the long run. It's boring work compared to playing with new models or techniques, but it's what separates teams that make steady progress from ones that spin their wheels.
There's this technique called representation engineering that lets you modify how AI models behave in a surprisingly effective way. Instead of carefully crafting prompts or retraining the model, you create “control vectors” that directly modify the model’s internal activations. The idea is simple: feed the model contrasting examples, like “act extremely happy” versus “act extremely sad,” capture the difference in how its neurons fire, and then add or subtract that difference during inference. The author shared some wild experiments, including an “acid trip” vector that made Mistral talk about kaleidoscopes and trippy patterns, a “lazy” vector that produced minimal answers, and even political leaning vectors. Each one takes about a minute to train.
What makes this interesting is the level of control it gives you. You can dial the effect up or down with a single number, which is almost impossible to achieve through prompt engineering alone. How would you make a model “slightly more honest” versus “extremely honest” with just words? The control vector approach also makes models more resistant to jailbreaks because the effect applies to every token, not just the prompt. The author demonstrated how a simple “car dealership” vector could resist the same kind of attack that famously bypassed Chevrolet’s chatbot. It feels like a genuinely practical tool for anyone deploying AI systems who wants fine-grained behavioural control without the hassle of constant prompt tweaks or costly fine-tuning.
I came across Stewart Brand’s pace layering framework because a former colleague and friend, Seb Wagner from Flow Republic, recommended it to me. It explains how different parts of society evolve at different speeds. Fashion and art change quickly, while deeper layers like culture, governance or nature move much more slowly. The fascinating bit is how these layers interact. Fast layers bring new ideas and push for change, but they are balanced and contained by the slower ones.
Pace Layers
You can see this dynamic clearly in modern tech. AI tools and interfaces shift almost weekly, business models evolve quarterly, infrastructure takes years, and regulation and ethics trail even further behind. Culture and environmental impact stretch over decades. The gap between speed and stability is where both tension and opportunity show up. For those of us working in AI, it’s a reminder to think not just about what’s new, but how those innovations sit on top of and eventually reshape the slower foundations beneath them.
I came across something genuinely interesting in code security this week: DeepMind’s CodeMender, an AI agent that doesn’t just flag vulnerabilities but actually fixes them and upstreams the patches to major open-source projects. Codemender leverages the "thinking" capabilities of Gemini Deep Think models to produce autonomous agent that is capable of debugging and fixing complex bugs and vulnerabilities.
It’s already contributed dozens of security improvements across large codebases, reasoning about root causes and rewriting risky patterns rather than applying quick patches.
What I like about this is how agentic the setup is. CodeMender uses a coordinated multi-agent system powered by Gemini, combining vulnerability detection, static analysis, patch validation, and code rewriting. It’s not just reactive either. For example, it’s been adding -fbounds-safety annotations to libwebp to proactively reduce entire classes of bugs. For anyone working on secure automation or agent protocols, this feels like a practical step forward.
When you feed a large table into an LLM, the way you format the input can change the model’s accuracy quite a bit. In a test of 11 formats (CSV, JSON, markdown table, YAML and more), a markdown “key: value” style scored around 60.7 % accuracy, which was far ahead of CSV at roughly 44.3 %. CSV and JSONL, despite being the usual defaults, were among the weakest performers.
What stood out to me was the trade off. The top format used many more tokens, so you have to balance cost and accuracy. For anyone working with agents, retrieval systems or table data, sticking with CSV by default might be leaving performance on the table. It is worth experimenting with different formats. Read the full article
Prompt engineering refers to methods for writing and organizing LLM instructions for optimal outcomes
Context Engineering
Context engineering refers to the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference, including all the other information that may land there outside of the prompts.
I like this framing because in the world of agentic systems, writing clever prompts alone won’t cut it. Agents operate in dynamic environments, constantly juggling new information. The real skill is curating which pieces of that evolving universe end up in context at the right moment. It’s a subtle but powerful shift that mirrors how good software architectures focus not only on code, but also on data flow.
If you’re building or designing AI agents, this is worth a read.
Ask any major AI if there's a seahorse emoji and they'll say yes with 100% confidence. Then ask them to show you, and they completely freak out, spitting random fish emojis in an endless loop. Plot twist: there's no seahorse emoji. Never has been. But tons of humans also swear they remember one existing.
Makes sense we'd all assume it exists though. Tons of ocean animals are emojis, so why not seahorses? The post above digs into what's happening inside the model using this interpretability technique called logit lens. The model builds up this internal concept of "seahorse + emoji" and genuinely believes it's about to output one. But when it hits the final layer that picks the actual token, there's no seahorse in the vocabulary. So it grabs the closest match, a tropical fish or horse, and outputs that. The AI doesn't realize it messed up until it sees its own wrong answer. Then some models catch themselves and backtrack, others just spiral into emoji hell.
I tried this myself with both Claude and ChatGPT and it looks like they've mostly fixed this now.
ChatGPT went through the whole confusion cycle (horse, dragon, then a bunch of random attempts) before finally catching itself and admitting there's no seahorse emoji. Claude went even further off the rails, confidently claiming the seahorse emoji is U+1F994 and telling me I should be able to find it on my keyboard.
It's a perfect example of how confidence means nothing. The model isn't lying or hallucinating in the usual sense. It's just wrong about something it reasonably assumed was true, then gets blindsided by reality.
Today I learned about a smarter way to deal with the headache of prompts in production. Drew Brunig’s talk at the Databricks Data + AI Summit is hands down the clearest explanation I’ve seen of why traditional prompting doesn’t scale well. He compares it to regex gone wild: what starts as a neat solution quickly becomes a brittle mess of instructions, examples, hacks, and model quirks buried inside giant text blocks that no one wants to touch. A single “good” prompt can have so many moving parts that it becomes practically unreadable.
DSPy takes a very different approach. Instead of hand-crafting and maintaining prompts, you define the task in a structured way and let the framework generate and optimise the prompts for you. You describe what goes in and what should come out, pick a strategy (like simple prediction, chain-of-thought, or tool use), and DSPy handles the formatting, parsing, and system prompt details behind the scenes. Because the task is decoupled from any specific model, switching to a better or cheaper model later is as easy as swapping it out and re-optimising.
This feels like a glimpse of where prompt engineering is heading: less manual tinkering, more structured task definitions and automated optimisation. I’ll definitely be trying DSPy out soon.
There’s a Japanese concept called nemawashi (literally “root-walking”) that offers a way around the dreaded “big reveal” in engineering proposals. Instead of marching into a meeting with your fully-formed design and expecting everyone to buy it, nemawashi encourages you to talk privately with all relevant stakeholders first and get feedback, surface objections, let people shape the idea, and build informal buy-in. By the time the formal meeting happens, the decision is mostly baked, not bombarded.
When I read “Quiet Influence: A Guide to Nemawashi in Engineering,” what struck me is how often we dismiss the political or social side of engineering work. A technically perfect solution can still die if colleagues feel blindsided, ignored, or defensive in a meeting. Adopting nemawashi has the power to transform you from someone pushing an idea to someone guiding a shared direction. For me (and for readers who work in cross-team or senior roles), it underlines a critical truth: influence is relational, not just visionary.
Ethan Mollick argues that AIs have quietly crossed a line. OpenAI recently tested models on complex, expert-designed tasks that usually take humans four to seven hours. Humans still performed better, but the gap is shrinking fast. Most AI mistakes were about formatting or following instructions, not reasoning.
The standout example is Claude 4.5 replicating academic research on its own. Work that would have taken hours was done in minutes, hinting at how whole fields could change when repetitive but valuable tasks get automated.
It’s a reminder that the real shift isn’t just about replacing jobs. It’s about rethinking how we work with AI so we don’t drown in a sea of AI-generated busywork.