Decoder generation is causal. At position t, model must not see future tokens > t. A triangular mask enforces this.
Prevents training-time leakage ("cheating").
Still allows parallel computation across positions during training.
Enables autoregressive next-token learning.
Cross-attention in encoder-decoder models
Cross-attention connects decoder to source input:
Q comes from decoder state.
K, V come from encoder outputs.
This lets each generated target token align with relevant source tokens (for translation, summarization, etc.).
flowchart LR
EN[Encoder outputs] --> K[Keys]
EN --> V[Values]
DE[Decoder hidden state] --> Q[Query]
Q --> ATT[Cross-attention]
K --> ATT
V --> ATT
ATT --> OUT[Decoder contextual output]