Jun 01, 2026

Query, Key, Values

How to think about Q, K, and V vectors in the Attention layer of a Large Language Model

[As part of my TIL series, building an intuition about Q, K, V]

A good way to understand QKV is this:

Attention is a soft lookup operation.

Given a token, the model asks:

“What information should I pull from the other tokens?”

Q, K and V are just three different projections of the same input token embeddings.

The simplest mental model

For each token, the model creates three vectors:

Query -> "What am I looking for?"
Key -> "What do I contain/advertise?"
Value -> "What information should I pass on if selected?"

So attention works like this:

Compare a token’s Query against every other token’s Key.
Turn those similarities into weights.
Use those weights to take a weighted average of the Values.

The formula is:

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V

Meaning:

similarity scores = QKᵀ
attention weights = softmax(similarity scores)
output = attention weights × V

Concrete example

Take the sentence:

The dog chased the ball because it was excited.

When processing the token “it”, the model needs to decide what “it” refers to.

For the token “it”:

Q_it = “I am looking for the thing this pronoun refers to”

Other tokens expose keys:

K_dog  = “I am an animal / possible subject”
K_ball = “I am an object / possible noun”

The model compares:

Q_it · K_dog
Q_it · K_ball

If Q_it · K_dog is higher, then “it” attends more strongly to “dog”.

Then the output for “it” becomes a weighted mixture of the value vectors, especially:

V_dog

So the model enriches the representation of “it” with information from “dog”.

Why separate Q, K and V?

This is the key bit.

The model does not use the raw token embedding directly. It learns three different views of each token:

Q = XW_Q
K = XW_K
V = XW_V

Same input X, different learned matrices.

Why?

Because “what I am looking for”, “how I should be matched”, and “what information I should contribute” are different jobs.

For example, the word “bank” might need to:

Q: look for context that disambiguates meaning
K: advertise that it is a noun, place, institution, river edge, etc.
V: contribute semantic content once selected

One embedding cannot do all of that cleanly. QKV gives the model specialised subspaces for matching and information transfer.

The database analogy

This is probably the most useful analogy:

Query  = search query
Key    = index / searchable metadata
Value  = retrieved content

Attention is like searching a database where every token is a record.

Token = record
Key   = searchable field
Value = payload
Query = search request from current token

The attention score says:

How relevant is this token’s key to my query?

The output says:

Give me the values from the most relevant tokens.

The important correction

People often say:

“Q asks a question, K answers it, V stores the answer.”

That is okay as a beginner analogy, but slightly misleading.

More accurately:

Q and K decide routing.
V carries content.

Q and K determine where to attend.

V determines what information gets copied/mixed into the output.

One-line understanding

QKV attention is learned content-based routing: each token forms a query, matches it against other tokens’ keys, then pulls back a weighted blend of their values.

Join AI engineers getting weekly insights on agents, RAG & production LLM systems

No spam, no sharing to third party. Only you and me.

Query, Key, Values

by Anup Jadhav

The simplest mental model

Concrete example

Why separate Q, K and V?

The database analogy

The important correction

One-line understanding

Member discussion

The simplest mental model

Concrete example

Why separate Q, K and V?

The database analogy

The important correction

One-line understanding

More like this

Speculative Decoding

On Durable Objects, Orleans, and prior art for the agentic web

TIL: Ads in AI chatbots are not just a UX problem

How "Thinking" Models Actually Work

We’re Being Too Loose With the Term “World Model”

TIL: Quantisation