Decoder Masking and Cross-Attention

Masked self-attention in decoder

Decoder generation is causal. At position t, model must not see future tokens > t. A triangular mask enforces this.

Prevents training-time leakage ("cheating").
Still allows parallel computation across positions during training.
Enables autoregressive next-token learning.

Cross-attention in encoder-decoder models

Cross-attention connects decoder to source input:

Q comes from decoder state.
K, V come from encoder outputs.

This lets each generated target token align with relevant source tokens (for translation, summarization, etc.).

flowchart LR EN[Encoder outputs] --> K[Keys] EN --> V[Values] DE[Decoder hidden state] --> Q[Query] Q --> ATT[Cross-attention] K --> ATT V --> ATT ATT --> OUT[Decoder contextual output]

Self-attention vs cross-attention

Mechanism	Q source	K/V source	Use
Self-attention	Same sequence	Same sequence	Intra-sequence context
Cross-attention	Decoder	Encoder	Source-target alignment