Positional Encoding and Multi-Head Attention

Why positional encoding is required

Self-attention is permutation-invariant by default. Without position signals, sentence order can be ambiguous.

So models add positional vectors to token embeddings: x_i = token_i + position_i.

Sinusoidal encoding idea

Why multi-head attention (MHA)

One head may focus on one relation type. Multiple heads learn complementary patterns in parallel (syntax, coreference, local position, semantics).

MultiHead(X) = Concat(head_1 ... head_h) W_o

Training tricks commonly used