Volodymyr Floreskul

“Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away.” - Antoine de Saint-Exupéry

This principle underpins the split into Keys, Queries, and Values. It may look like extra complexity, but it’s a deliberate balance between expressiveness and simplicity. Over time we only added pieces when simpler setups stopped working.

The Anatomy of Modern Attention

At the heart of every Transformer lies the attention mechanism, which uses four key elements:

Input embeddings - the raw representation of our data
Keys (K) - what information is available
Queries (Q) - what information we’re looking for
Values (V) - the information to be aggregated if selected

Why this configuration? The answer comes from how we got here: refining simpler systems and adding complexity only when necessary.

The Evolution of Complexity: From One to Three

Stage 1: One Vector to Rule Them All

Early semantic search and similarity matching used a single embedding for all roles:

Embeddings = K = Q = V

Classic examples include:

Word2Vec (Mikolov et al., 2013) for word similarity search
GloVe (Pennington et al., 2014) for global word embeddings
Siamese Networks (Koch et al., 2015) for one-shot learning

You encode items into vectors and compare via cosine similarity or dot product. The same representation serves as the searchable index (keys), the search query (queries), and the retrieved content (values). This works well for basic retrieval: “What’s most similar to this?”

Why it works: the similarity function is enough to rank items.
Why it’s limited: one space must represent content, express queries, and carry information forward, so matching and message passing are entangled.

Stage 2: Two Towers - Separating Queries from Keys

As tasks grew more sophisticated, dual-encoder retrieval emerged: one encoder for queries, another for documents. Now Q and K live in related but distinct spaces. You still score with $\langle Q, K \rangle$, but you’ve acknowledged that asking (querying) and being found (keying) are different jobs.

Key architectures in this stage:

Bahdanau attention (2014) in neural machine translation, where decoder hidden states query encoder outputs
Show, Attend and Tell (Xu et al., 2015) for image captioning
Pointer Networks (Vinyals et al., 2015) where decoder queries point to encoder positions

Why it helps: the system can specialize. Query encoders learn to express intent; key encoders learn to advertise content.
What’s still missing: there’s no separate value pathway to carry information onward. Retrieval returns an ID or embedding; it doesn’t yet mix content across many tokens as a computation step.

Stage 3: Cross-Attention - Keys Separate from Values

In encoder-decoder models (for example, translation), the decoder forms Q from its current state and attends over K, V from the encoder. This is where V really matters: K is for addressing, V carries the payload.

Why it helps: you can point at states (via K) and lift their content (via V). Queries do the pointing, values do the carrying. This separation recognizes that what makes something easy to find (keys) isn’t necessarily what makes it useful once found (values).

Stage 4: Self-Attention - The Transformer Trinity

The Transformer’s breakthrough (Vaswani et al., 2017) was recognizing that all three components serve different roles. Every token creates its own Q, K, V:

Queries (Q) represent “what information am I looking for from my current position?”
Keys (K) represent “what information can I provide to others?”
Values (V) represent “what content should I contribute if selected?”

This separation allows specialization:

Q vs K enables asymmetric matching: the model learns a bilinear form $QK^\top$ where what counts as a good match depends on who is asking.
V decouples content from matching: you don’t copy the key itself; you project a representation optimized for downstream layers.
An output projection $W_O$ then remixes the concatenated heads after attention.

The attention scores $\frac{QK^\top}{\sqrt{d_k}}$ determine how much each value contributes to the output. This yields a learned metric for matching (via $W_Q$ and $W_K$) that’s independent of the content channel (via $W_V$). Three projections have proven sufficient for complex tasks across language, vision-language, and beyond.

Why Not Four? The Diminishing Returns of Additional Complexity

The Mathematical Redundancy Problem

Many proposed “fourth matrix” ideas are algebraically redundant. If you add a bilinear form $W_b$ between queries and keys:

$$q^\top W_b k = (W_b^\top q)^\top k = q^\top (W_b k),$$

so it can be folded into $W_Q$ or $W_K$ unless $W_b$ is conditional or nonlinear.

Where Researchers Actually Add Complexity

When extra complexity helps, it usually crosses a nonlinearity or adds structure rather than adding another plain projection. Well-known examples:

Talking-Heads attention - mixes heads before or after softmax
Relative position bias and RoPE - structured positional signals
Performer - kernelized attention for efficient weighting
Multi-Query and Grouped-Query Attention - share K/V across heads to scale decoding

Essential degrees of freedom

K-Q-V persists because it balances expressivity, compute, and clarity. The three roles map directly to what the model must do:

Asking (Q): express information needs. Separating Q from K enables a learned asymmetric match $QK^\top$. Extra linear projections are usually redundant, and since attention is $O(n^2)$, those parameters often pay off more in wider MLPs or added depth.
Matching (K): determine relevance. K provides an indexable view for addressing; additional linear maps between Q and K can often be folded into $W_Q$ or $W_K$.
Retrieving (V): carry content forward. V decouples message passing from matching; keeping roles clean (Q asks, K matches, V carries) makes the block easier to reason about and to scale.

Looking Forward: The Stability of Three

Recent work focuses less on inventing a fourth projection and more on using the three we have more cleverly:

Mixture of Experts (MoE) models route different tokens to specialized K-Q-V subnetworks
Hybrid architectures like Griffin (DeepMind, 2024) combine local attention with recurrent layers for million-token contexts
State-space models (for example, Mamba) that model sequences as learned dynamical systems with linear-time inference still echo the ask-match-retrieve pattern

Before we find a fundamentally different architecture, we found a sweet spot in the balance of Keys, Queries, and Values - complex enough to be powerful, simple enough to scale. As we push toward longer contexts and multimodal understanding, the principle remains: make it as complex as necessary, but not one matrix more.