Positional Encoding and Multi-Head Attention
Why positional encoding is required
Self-attention is permutation-invariant by default. Without position signals, sentence order can be ambiguous.
So models add positional vectors to token embeddings: x_i = token_i + position_i.
Sinusoidal encoding idea
- Uses sine on even dimensions and cosine on odd dimensions.
- Provides unique position patterns and supports relative distance reasoning.
- No extra trainable position parameters required (in fixed sinusoidal variant).
Why multi-head attention (MHA)
One head may focus on one relation type. Multiple heads learn complementary patterns in parallel (syntax, coreference, local position, semantics).
MultiHead(X) = Concat(head_1 ... head_h) W_o
Training tricks commonly used
- Residual connections (Add)
- Layer normalization (Norm)
- Scaled dot-product attention
- Position signals added to embeddings