Today I Learned (TIL)

Things I've learned or things I find interesting will be logged here. For long-form content, you might want to check out my newsletter.

LLMs develop distinct trading personalities when given real money

Six LLMs each received $10,000 to trade perpetual futures with zero human intervention, and Claude Sonnet 4.5 almost

19 Nov

pgvector doesn't scale

It’s funny how engineers (myself included) assume that if you can store vectors in Postgres, you should. The logic

03 Nov

TIL: How Transformers work

If you want to actually understand transformers, this guide nails it. I've read a bunch of explanations and

30 Oct

Fine-Tuning makes a comeback

Fine-tuning went from the hottest thing in machine learning to accounting for less than 10% of AI workloads in just

29 Oct

How to read research papers

Most engineers read research papers like blog posts, expecting instant clarity. That’s why so many give up halfway through.

28 Oct

Choosing an LLM Is Choosing a World-View

Some LLMs lean left. Others lean right. The Anomify study shows that mainstream models are not neutral arbiters of truth,

23 Oct

Reasoning LLMs are wanderers rather than systematic explorers

It turns out that when we ask reasoning capable models such as the latest LLMs (GPT-5 family, Claude Opus and

21 Oct

Horizon Length - Moore's law for AI Agents

This post from LessWrong critiques the idea of “horizon length” (a benchmark from METR that ranks tasks by how long

17 Oct

The boring secret to building better AI agents

Andrew Ng pointed out something interesting: the single biggest factor in how fast teams build AI agents isn't

16 Oct

Representation Engineering with Control Vectors

There's this technique called representation engineering that lets you modify how AI models behave in a surprisingly effective

16 Oct

Pace Layering Framework

I came across Stewart Brand’s pace layering framework because a former colleague and friend, Seb Wagner from Flow Republic,

10 Oct

AI Agents Are Finally Fixing Real-World Code Security Problems

I came across something genuinely interesting in code security this week: DeepMind’s CodeMender, an AI agent that doesn’t

09 Oct

Data in CSV Format Isn’t Always the Best for LLMs

When you feed a large table into an LLM, the way you format the input can change the model’s

07 Oct

Prompt Engineering vs Context Engineering

From Anthropic: https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents Prompt Engineering: Prompt engineering refers to methods for writing and organizing LLM

06 Oct

Why LLMs Confidently Hallucinate a Seahorse Emoji That Never Existed

Ask any major AI if there's a seahorse emoji and they'll say yes with 100% confidence.

06 Oct

Goodbye Manual Prompts, Hello DSPy

Today I learned about a smarter way to deal with the headache of prompts in production. Drew Brunig’s talk

05 Oct

Nemawashi (根回し)

There’s a Japanese concept called nemawashi (literally “root-walking”) that offers a way around the dreaded “big reveal” in engineering

05 Oct

AI agents are starting to do real work

Ethan Mollick argues that AIs have quietly crossed a line. OpenAI recently tested models on complex, expert-designed tasks that usually

04 Oct

LLM

Quant Trading

LLMs develop distinct trading personalities when given real money

Six LLMs each received $10,000 to trade perpetual futures with zero human intervention, and Claude Sonnet 4.5 almost never shorts anything. Grok 4 holds positions for days. Qwen 3 consistently makes the biggest bets. These aren't random quirks but persistent behavioral patterns across thousands of trades, despite all models receiving identical prompts, identical market data, and identical instructions.

The setup was deliberately minimal: no news feeds, no narrative context, just price movements and technical indicators arriving every few minutes. The models had to infer everything from the numbers alone. But rather than converging on similar strategies, they diverged dramatically. GPT-5 consistently reports low confidence while taking positions anyway. Gemini 2.5 Pro trades three times more frequently than Grok 4. The sensitivity runs so deep that reversing data order from newest-first to oldest-first could flip a model from bullish to bearish. Therefore what emerges isn't evidence that LLMs can trade profitably (early results showed fees eating most returns), but that they exhibit stable risk preferences when forced into sequential decision-making under uncertainty.

The experiment continues live until November 2025 with real capital on Hyperliquid, part of a broader push toward dynamic benchmarks over static tests that models can memorise. Recent papers like arXiv:2511.12599 explore risk frameworks for LLM traders, though most research still focuses on prediction rather than execution. Nof1's team documented failure modes including "self-referential confusion" where models misread their own trading plans, suggesting these aren't sophisticated traders but pattern-matchers revealing their training biases through market behavior.

Original article 👉 Exploring the Limits of LLMs as Quant Traders

pgvector doesn't scale

It’s funny how engineers (myself included) assume that if you can store vectors in Postgres, you should. The logic feels sound: one database, one backup, one mental model. But the moment you hit scale, that convenience quietly turns into a trap.

In Alex Jacobs’s piece “The Case Against pgvector” he writes that you “pick an index type and then never rebalance, so recall can drift.” That line hit me because it captures the hidden friction: Postgres was built for structured queries, not high-dimensional vector search. Jacobs shows how building an index on millions of vectors can consume “10+ GB of RAM and hours of build time” on a production database.

Then comes the filtering trap. Jacobs points out that if you want “only published documents” combined with similarity search, the order of filtering matters. Filter before and it’s fast. Filter after and your query can take seconds instead of milliseconds. That gap is invisible in prototypes but painful in production.

The takeaway is clear. Convenience is not a strategy. If your vector workloads go beyond trivial, use a dedicated vector database. The single-system story looks tidy on a diagram but often costs you far more in latency and maintenance.

https://alex-jacobs.com/posts/the-case-against-pgvector/

TIL: How Transformers work

If you want to actually understand transformers, this guide nails it. I've read a bunch of explanations and this one finally made the pieces fit together.

The thing that works is it doesn't just throw the architecture at you. It shows you the whole messy history first. RNNs couldn't remember long sequences. LSTMs tried to fix that but got painfully slow. CNNs were faster but couldn't hold context. Then Google Brain basically said "screw it, let's bin recurrence completely and just use attention." That's how we got the famous paper. Once you see that chain of failures and fixes, transformers stop being this weird abstract thing. You get why masked attention exists, why residual connections matter, why positional encodings had to be added. It all clicks because you see what problem each bit solves.

How Query, Key and Value matrices are derived - source: krupadave.com

The hand-drawn illustrations help too. There's a Google search analogy for queries, keys, and values that made way more sense than the maths notation ever did. And the water pressure metaphor for residual connections actually stuck with me. It took the author months to research and draw everything. You can tell because it doesn't feel rushed or surface-level. If you've been putting this off because most explanations either skim over details or drown you in equations, this one gets the balance right.

https://www.krupadave.com/articles/everything-about-transformers?x=v3

Fine-Tuning

Fine-Tuning makes a comeback

Fine-tuning went from the hottest thing in machine learning to accounting for less than 10% of AI workloads in just a couple of years. Teams figured out they could get 90% of the way there with prompt engineering and RAG, so why bother with the extra complexity? Sensible move. But now something's shifting. Mira Murati's new $12B startup is betting big on fine-tuning-as-a-platform, and the ecosystem seems to be nodding along.

Here's what actually changed. Generic models are brilliant at being generic, but companies are starting to bump into a ceiling. You can prompt engineer all day, but your model still won't truly know your taxonomy, speak in your exact tone, or handle your specific compliance rules the way a properly trained system would. The pendulum is swinging back not because prompting failed, but because it succeeded at everything except the final 10% that actually matters for differentiation. Open-weight models like Llama and Mistral make this practical now. You can own and persist your fine-tuned variants without vendor lock-in.

This isn't the same hype cycle as before. Back then, fine-tuning was trendy. Now it's strategic. Companies want control, and they're willing to invest in bespoke intelligence instead of settling for good enough. The irony is that we spent years learning how to avoid fine-tuning, only to discover that some problems really do require teaching the model your specific language, not just describing it in a prompt.

Link to the original article: The Case for the Return of Fine-Tuning

AI Research

How to read research papers

Most engineers read research papers like blog posts, expecting instant clarity. That’s why so many give up halfway through. The trick isn’t to read harder but to read differently. This https://blog.codingconfessions.com/p/a-software-engineers-guide-to-reading-papers reframes paper reading as a process you can iterate on, not a one-shot test of intelligence.

The author suggests a multi-pass approach. First, skim the abstract, intro, results and conclusion to see if the paper is even relevant. Next, read the body while flagging any gaps in your understanding. Finally, revisit it with fresh context and ask why each step exists. The shift is subtle but powerful: instead of fighting the paper, you collaborate with it.

What I’m taking away is this. Reading research is a skill, not a talent. If I approach papers as layered workflows rather than puzzles to solve in one go, I’ll extract more ideas I can actually build on.

Original article: https://blog.codingconfessions.com/p/a-software-engineers-guide-to-reading-papers

Choosing an LLM Is Choosing a World-View

Some LLMs lean left. Others lean right. The Anomify study shows that mainstream models are not neutral arbiters of truth, they come with their own built-in world-views. That means the answer you get is shaped not only by your prompt, but by the ideological fingerprint of the model you chose in the first place.

I assumed most models would at least converge on a kind of centrist neutrality, but the experiment revealed clear and consistent patterns in how they respond to social and political questions. One model might advocate for stronger regulation while another leans libertarian. Some avoid topics entirely while others dive in. This matters because it is easy to treat LLM output as objective when it is really a reflection of training data, guardrails, and product philosophy.

The takeaway is simple. If you are using an LLM for reasoning or advice, the choice of model is a design decision, not a cosmetic one. You are not only choosing a capability profile. You are inheriting a point of view.
Link to study: https://anomify.ai/resources/articles/llm-bias

LLM

evals

Reasoning LLMs are wanderers rather than systematic explorers

It turns out that when we ask reasoning capable models such as the latest LLMs (GPT-5 family, Claude Opus and successors, Gemini 1.5 Pro etc.) to think through problems, they often behave like explorers wandering aimlessly rather than systematic searchers. The paper titled Reasoning LLMs are Wandering Solution Explorers formalises what it means to systematically probe a solution space (valid transitions, reaching a goal, no wasted states), but then shows that these models frequently deviate by skipping necessary states, revisiting old ones, hallucinating conclusions or making invalid transitions. This approach can still look effective on simple tasks, but once the solution space grows in depth or complexity, the weaknesses surface. Therefore the authors argue that large models are often wandering rather than reasoning, but their mistakes stay hidden on shallow problems.

The upshot is that a wanderer can stumble into answers on small search spaces, but that same behaviour collapses when the task becomes deep or requires strict structure. The authors show mathematically and empirically that shallow success can disguise systemic flaws, but deeper problems expose the lack of disciplined search. Therefore performance plateaus for complex reasoning cannot simply be fixed by adding more tokens or more compute, but instead require changes in how we guide or constrain the reasoning process.

Performance degradation chart - solution coverage vs problem size

For us as AI engineers, this is useful because it reinforces a shift from evaluating only outcomes to evaluating the path the model took to get there. A model that reasons by wandering might appear competent, but it becomes unreliable in real systems that require correctness, traceability and depth. Therefore we may need new training signals, architectural biases or process based evaluation to build agentic systems we can trust. In other words, good reasoning agents need maps, not just bigger backpacks.

Worth a quick scan 👉 https://arxiv.org/abs/2505.20296v1

Horizon Length - Moore's law for AI Agents

This post from LessWrong critiques the idea of “horizon length” (a benchmark from METR that ranks tasks by how long humans take, and then measures how long AIs can handle) as a kind of Moore’s law for agents. The author argues that using task duration as a proxy for difficulty is unreliable. Different tasks vary in more than just time cost, such as the need for conceptual leaps, domain novelty, or dealing with messy data. Because of that, there’s no clean mapping between “time to human” and “difficulty for an agent.” The benchmark is also biased because it only measures tasks that can be clearly specified and automatically checked, which naturally favour the kinds of problems current AI systems are already good at.

What I found most useful is the caution this offers about overinterpreting neat metrics. It’s tempting to extrapolate from horizon length that AIs will soon take on longer tasks that span hours or days, and from there to assume they’ll automate R&D or cause major disruptions. The author’s point is that even if the trend holds within these benchmarks, it doesn’t necessarily reflect real-world capabilities. For anyone working in AI, this is a useful reminder to always examine how well a proxy aligns with what actually matters, and to watch out for evaluation artefacts that give a false sense of progress.

Read the full post here 👉 https://www.lesswrong.com/posts/PzLSuaT6WGLQGJJJD/the-length-of-horizons

evals

The boring secret to building better AI agents

Andrew Ng pointed out something interesting: the single biggest factor in how fast teams build AI agents isn't using the latest tools or techniques. It's having a disciplined process for measuring performance (evals) and figuring out why things break (error analysis).

He compares it to how musicians don't just play a piece start to finish over and over. They find the tricky parts and practice those specifically. Or how you don't just chase nutrition trends but actually look at your bloodwork to see what's actually wrong. The idea is simple but easy to forget when you're caught up in trying every new AI technique that goes viral on social media.

The tricky part with AI agents is that there are so many more ways things can go wrong compared to traditional machine learning. If you're building something to process financial invoices automatically, the agent could mess up the due date, the amount, the currency, mix up addresses, or make the wrong API call. The output space is huge. Ng's approach is to build a quick prototype first, manually look at where it stumbles, and then create specific tests for those problem areas. Sometimes these are objective metrics you can code up, sometimes you need to use another LLM to judge the outputs. It's more iterative and messy than traditional ML, but that's the point. You need to see where it actually fails in practice before you know what to measure.

This resonates with me because it's the opposite of what feels productive in the moment. When something breaks, you want to jump in and fix it immediately. But Ng's argument is that slowing down to understand the root cause actually speeds you up in the long run. It's boring work compared to playing with new models or techniques, but it's what separates teams that make steady progress from ones that spin their wheels.

Prompt Engineering

Representation Engineering with Control Vectors

There's this technique called representation engineering that lets you modify how AI models behave in a surprisingly effective way. Instead of carefully crafting prompts or retraining the model, you create “control vectors” that directly modify the model’s internal activations. The idea is simple: feed the model contrasting examples, like “act extremely happy” versus “act extremely sad,” capture the difference in how its neurons fire, and then add or subtract that difference during inference. The author shared some wild experiments, including an “acid trip” vector that made Mistral talk about kaleidoscopes and trippy patterns, a “lazy” vector that produced minimal answers, and even political leaning vectors. Each one takes about a minute to train.

What makes this interesting is the level of control it gives you. You can dial the effect up or down with a single number, which is almost impossible to achieve through prompt engineering alone. How would you make a model “slightly more honest” versus “extremely honest” with just words? The control vector approach also makes models more resistant to jailbreaks because the effect applies to every token, not just the prompt. The author demonstrated how a simple “car dealership” vector could resist the same kind of attack that famously bypassed Chevrolet’s chatbot. It feels like a genuinely practical tool for anyone deploying AI systems who wants fine-grained behavioural control without the hassle of constant prompt tweaks or costly fine-tuning.

More details here 👉 https://vgel.me/posts/representation-engineering/

Pace Layering Framework

I came across Stewart Brand’s pace layering framework because a former colleague and friend, Seb Wagner from Flow Republic, recommended it to me. It explains how different parts of society evolve at different speeds. Fashion and art change quickly, while deeper layers like culture, governance or nature move much more slowly. The fascinating bit is how these layers interact. Fast layers bring new ideas and push for change, but they are balanced and contained by the slower ones.

You can see this dynamic clearly in modern tech. AI tools and interfaces shift almost weekly, business models evolve quarterly, infrastructure takes years, and regulation and ethics trail even further behind. Culture and environmental impact stretch over decades. The gap between speed and stability is where both tension and opportunity show up. For those of us working in AI, it’s a reminder to think not just about what’s new, but how those innovations sit on top of and eventually reshape the slower foundations beneath them.

Link to the original article:
https://sketchplanations.substack.com/p/pace-layers

Gemini

CodeMender

AI Agents Are Finally Fixing Real-World Code Security Problems

I came across something genuinely interesting in code security this week: DeepMind’s CodeMender, an AI agent that doesn’t just flag vulnerabilities but actually fixes them and upstreams the patches to major open-source projects. Codemender leverages the "thinking" capabilities of Gemini Deep Think models to produce autonomous agent that is capable of debugging and fixing complex bugs and vulnerabilities.

It’s already contributed dozens of security improvements across large codebases, reasoning about root causes and rewriting risky patterns rather than applying quick patches.

What I like about this is how agentic the setup is. CodeMender uses a coordinated multi-agent system powered by Gemini, combining vulnerability detection, static analysis, patch validation, and code rewriting. It’s not just reactive either. For example, it’s been adding -fbounds-safety annotations to libwebp to proactively reduce entire classes of bugs. For anyone working on secure automation or agent protocols, this feels like a practical step forward.

👉🏽 Read the full post

RAG

Data in CSV Format Isn’t Always the Best for LLMs

When you feed a large table into an LLM, the way you format the input can change the model’s accuracy quite a bit. In a test of 11 formats (CSV, JSON, markdown table, YAML and more), a markdown “key: value” style scored around 60.7 % accuracy, which was far ahead of CSV at roughly 44.3 %. CSV and JSONL, despite being the usual defaults, were among the weakest performers.

What stood out to me was the trade off. The top format used many more tokens, so you have to balance cost and accuracy. For anyone working with agents, retrieval systems or table data, sticking with CSV by default might be leaving performance on the table. It is worth experimenting with different formats. Read the full article

AI Engineering

Prompt Engineering vs Context Engineering

From Anthropic: https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents

Prompt Engineering:

Prompt engineering refers to methods for writing and organizing LLM instructions for optimal outcomes

Context Engineering

Context engineering refers to the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference, including all the other information that may land there outside of the prompts.

I like this framing because in the world of agentic systems, writing clever prompts alone won’t cut it. Agents operate in dynamic environments, constantly juggling new information. The real skill is curating which pieces of that evolving universe end up in context at the right moment. It’s a subtle but powerful shift that mirrors how good software architectures focus not only on code, but also on data flow.

If you’re building or designing AI agents, this is worth a read.

Why LLMs Confidently Hallucinate a Seahorse Emoji That Never Existed

Ask any major AI if there's a seahorse emoji and they'll say yes with 100% confidence. Then ask them to show you, and they completely freak out, spitting random fish emojis in an endless loop. Plot twist: there's no seahorse emoji. Never has been. But tons of humans also swear they remember one existing.

Check out the analysis in this post 👉🏽 https://vgel.me/posts/seahorse/

Makes sense we'd all assume it exists though. Tons of ocean animals are emojis, so why not seahorses? The post above digs into what's happening inside the model using this interpretability technique called logit lens. The model builds up this internal concept of "seahorse + emoji" and genuinely believes it's about to output one. But when it hits the final layer that picks the actual token, there's no seahorse in the vocabulary. So it grabs the closest match, a tropical fish or horse, and outputs that. The AI doesn't realize it messed up until it sees its own wrong answer. Then some models catch themselves and backtrack, others just spiral into emoji hell.

I tried this myself with both Claude and ChatGPT and it looks like they've mostly fixed this now.

ChatGPT went through the whole confusion cycle (horse, dragon, then a bunch of random attempts) before finally catching itself and admitting there's no seahorse emoji. Claude went even further off the rails, confidently claiming the seahorse emoji is U+1F994 and telling me I should be able to find it on my keyboard.

It's a perfect example of how confidence means nothing. The model isn't lying or hallucinating in the usual sense. It's just wrong about something it reasonably assumed was true, then gets blindsided by reality.

Prompt Engineering

dspy

Goodbye Manual Prompts, Hello DSPy

Today I learned about a smarter way to deal with the headache of prompts in production. Drew Brunig’s talk at the Databricks Data + AI Summit is hands down the clearest explanation I’ve seen of why traditional prompting doesn’t scale well. He compares it to regex gone wild: what starts as a neat solution quickly becomes a brittle mess of instructions, examples, hacks, and model quirks buried inside giant text blocks that no one wants to touch. A single “good” prompt can have so many moving parts that it becomes practically unreadable.

DSPy takes a very different approach. Instead of hand-crafting and maintaining prompts, you define the task in a structured way and let the framework generate and optimise the prompts for you. You describe what goes in and what should come out, pick a strategy (like simple prediction, chain-of-thought, or tool use), and DSPy handles the formatting, parsing, and system prompt details behind the scenes. Because the task is decoupled from any specific model, switching to a better or cheaper model later is as easy as swapping it out and re-optimising.

This feels like a glimpse of where prompt engineering is heading: less manual tinkering, more structured task definitions and automated optimisation. I’ll definitely be trying DSPy out soon.

https://www.youtube.com/watch?v=I9ZtkgYZnOw

leadership

Nemawashi (根回し)

There’s a Japanese concept called nemawashi (literally “root-walking”) that offers a way around the dreaded “big reveal” in engineering proposals. Instead of marching into a meeting with your fully-formed design and expecting everyone to buy it, nemawashi encourages you to talk privately with all relevant stakeholders first and get feedback, surface objections, let people shape the idea, and build informal buy-in. By the time the formal meeting happens, the decision is mostly baked, not bombarded.

When I read “Quiet Influence: A Guide to Nemawashi in Engineering,” what struck me is how often we dismiss the political or social side of engineering work. A technically perfect solution can still die if colleagues feel blindsided, ignored, or defensive in a meeting. Adopting nemawashi has the power to transform you from someone pushing an idea to someone guiding a shared direction. For me (and for readers who work in cross-team or senior roles), it underlines a critical truth: influence is relational, not just visionary.

👉🏽 https://hodgkins.io/blog/quiet-influence-a-guide-to-nemawashi-in-engineering/

AI Engineering

AI agents are starting to do real work

Ethan Mollick argues that AIs have quietly crossed a line. OpenAI recently tested models on complex, expert-designed tasks that usually take humans four to seven hours. Humans still performed better, but the gap is shrinking fast. Most AI mistakes were about formatting or following instructions, not reasoning.

The standout example is Claude 4.5 replicating academic research on its own. Work that would have taken hours was done in minutes, hinting at how whole fields could change when repetitive but valuable tasks get automated.

It’s a reminder that the real shift isn’t just about replacing jobs. It’s about rethinking how we work with AI so we don’t drown in a sea of AI-generated busywork.

👉 Read Ethan’s full piece