Self-Attention and Q, K, V

Self-attention lets every token compare itself to every other token and decide relevance dynamically.

Q/K/V intuition

Query (Q): what this token is looking for.
Key (K): what each token offers.
Value (V): information to pass if selected.

YouTube search analogy

Query (Q) — what you type in the search bar.
Key (K) — titles/tags each video exposes for matching.
Value (V) — the actual content returned after ranking.

The platform scores Q against all K vectors, softmax-ranks them, and blends the corresponding V vectors.

Core formula

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V

Scaling by sqrt(d_k) keeps scores numerically stable before softmax.

flowchart LR X[Token embeddings] --> Q[Q = XWq] X --> K[K = XWk] X --> V[V = XWv] Q --> S[Scaled dot product scores] K --> S S --> A[Softmax weights] A --> O[Weighted sum over V] V --> O

Example intuition: pronoun resolution

Sentence: "The animal did not cross the street because it was tired."

For token it, attention usually assigns high weight to animal and tired, helping resolve meaning.