Decoder Masking and Cross-Attention

Masked self-attention in decoder

Decoder generation is causal. At position t, model must not see future tokens > t. A triangular mask enforces this.

Cross-attention in encoder-decoder models

Cross-attention connects decoder to source input:

This lets each generated target token align with relevant source tokens (for translation, summarization, etc.).

flowchart LR EN[Encoder outputs] --> K[Keys] EN --> V[Values] DE[Decoder hidden state] --> Q[Query] Q --> ATT[Cross-attention] K --> ATT V --> ATT ATT --> OUT[Decoder contextual output]

Self-attention vs cross-attention

MechanismQ sourceK/V sourceUse
Self-attentionSame sequenceSame sequenceIntra-sequence context
Cross-attentionDecoderEncoderSource-target alignment