Self-attention lets every token compare itself to every other token and decide relevance dynamically.
Q/K/V intuition
Query (Q): what this token is looking for.
Key (K): what each token offers.
Value (V): information to pass if selected.
YouTube search analogy
Query (Q) — what you type in the search bar.
Key (K) — titles/tags each video exposes for matching.
Value (V) — the actual content returned after ranking.
The platform scores Q against all K vectors, softmax-ranks them, and blends the corresponding V vectors.
Core formula
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
Scaling by sqrt(d_k) keeps scores numerically stable before softmax.
flowchart LR
X[Token embeddings] --> Q[Q = XWq]
X --> K[K = XWk]
X --> V[V = XWv]
Q --> S[Scaled dot product scores]
K --> S
S --> A[Softmax weights]
A --> O[Weighted sum over V]
V --> O
Example intuition: pronoun resolution
Sentence: "The animal did not cross the street because it was tired."
For token it, attention usually assigns high weight to animal and tired, helping resolve meaning.