TIL: Quantisation

Spent some time properly working through quantisation this week.

I liked this piece (from ngrok) because it does not stop at “make the model smaller”. It gets into the actual mechanics: lower-bit representations, scale factors, dequantisation, and the trade-off between compression and error.

I also implemented a small version of the workflow locally to make the ideas concrete for myself. The bit that stood out most was how different symmetric and asymmetric quantisation can be once you actually look at the error distribution, rather than just the file size.

My main takeaway is that quantisation is really a precision-allocation problem. The question is not just how much you can compress a model, but how much numerical fidelity you can give up before your task stops working.

Join engineers getting weekly insights on agents, RAG & production LLM systems

No spam, no sharing to third party. Only you and me.

Member discussion