Query, Key, Values

How to think about Q, K, and V vectors in the Attention layer of a Large Language Model

[As part of my TIL series, building an intuition about Q, K, V]

A good way to understand QKV is this:

Attention is a soft lookup operation.

Given a token, the model asks:

“What information should I pull from the other tokens?”

Q, K and V are just three different projections of the same input token embeddings.

The simplest mental model

For each token, the model creates three vectors:

  • Query -> "What am I looking for?"
  • Key -> "What do I contain/advertise?"
  • Value -> "What information should I pass on if selected?"

So attention works like this:

  1. Compare a token’s Query against every other token’s Key.
  2. Turn those similarities into weights.
  3. Use those weights to take a weighted average of the Values.

The formula is:

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V

Meaning:

similarity scores = QKᵀ
attention weights = softmax(similarity scores)
output = attention weights × V

Concrete example

Take the sentence:

The dog chased the ball because it was excited.

When processing the token “it”, the model needs to decide what “it” refers to.

For the token “it”:

Q_it = “I am looking for the thing this pronoun refers to”

Other tokens expose keys:

K_dog  = “I am an animal / possible subject”
K_ball = “I am an object / possible noun”

The model compares:

Q_it · K_dog
Q_it · K_ball

If Q_it · K_dog is higher, then “it” attends more strongly to “dog”.

Then the output for “it” becomes a weighted mixture of the value vectors, especially:

V_dog
QKV matrix calculation resolving "it" to "dog"Three stages: the QK transpose score grid, the softmax attention weights, and multiplication by the value matrix V to produce the output for the token "it". The dog ... it A is the masked attention matrix. The "it" row is what we trace. scores = QKᵀ / √dₖ how aligned is each q·k dog ball it dog ball it 8.1 2.3 3.0 softmax A = attention weights each row sums to 1 dog ball it dog ball it 1.0 0.4 0.6 0.82 0.06 0.12 × V V = value rows content each token carries dog ball it v_dog v_ball v_it out_it Read the "it" row left to right Scores 8.1 / 2.3 / 3.0 → softmax → weights 0.82 / 0.06 / 0.12 out_it = 0.82·v_dog + 0.06·v_ball + 0.12·v_it. Mostly dog. The two roles, made concrete Left grid (Q·Kᵀ) decides routing: where "it" looks. Right matrix (V) carries content: what gets copied back. Grayed cells are masked: "it" cannot attend to future tokens.

So the model enriches the representation of “it” with information from “dog”.

Why separate Q, K and V?

This is the key bit.

The model does not use the raw token embedding directly. It learns three different views of each token:

Q = XW_Q
K = XW_K
V = XW_V

Same input X, different learned matrices.

Why?

Because “what I am looking for”, “how I should be matched”, and “what information I should contribute” are different jobs.

For example, the word “bank” might need to:

Q: look for context that disambiguates meaning
K: advertise that it is a noun, place, institution, river edge, etc.
V: contribute semantic content once selected

One embedding cannot do all of that cleanly. QKV gives the model specialised subspaces for matching and information transfer.

The database analogy

This is probably the most useful analogy:

Query  = search query
Key    = index / searchable metadata
Value  = retrieved content

Attention is like searching a database where every token is a record.

Token = record
Key   = searchable field
Value = payload
Query = search request from current token

The attention score says:

How relevant is this token’s key to my query?

The output says:

Give me the values from the most relevant tokens.

The important correction

People often say:

“Q asks a question, K answers it, V stores the answer.”

That is okay as a beginner analogy, but slightly misleading.

More accurately:

Q and K decide routing.
V carries content.

Q and K determine where to attend.

V determines what information gets copied/mixed into the output.

One-line understanding

QKV attention is learned content-based routing: each token forms a query, matches it against other tokens’ keys, then pulls back a weighted blend of their values.

Join AI engineers getting weekly insights on agents, RAG & production LLM systems

No spam, no sharing to third party. Only you and me.

Member discussion