Self-Attention and Q, K, V

Self-attention lets every token compare itself to every other token and decide relevance dynamically.

Q/K/V intuition

YouTube search analogy

The platform scores Q against all K vectors, softmax-ranks them, and blends the corresponding V vectors.

Core formula

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V

Scaling by sqrt(d_k) keeps scores numerically stable before softmax.

flowchart LR X[Token embeddings] --> Q[Q = XWq] X --> K[K = XWk] X --> V[V = XWv] Q --> S[Scaled dot product scores] K --> S S --> A[Softmax weights] A --> O[Weighted sum over V] V --> O

Example intuition: pronoun resolution

Sentence: "The animal did not cross the street because it was tired."

For token it, attention usually assigns high weight to animal and tired, helping resolve meaning.