Autoregressive Decoding

Decoder-only generation produces one token at a time: previous tokens become context for the next token.

Decoder loop

  1. Start with <BOS> or user prompt.
  2. Run model and get next-token probabilities.
  3. Select token (greedy/beam/sampling).
  4. Append token and repeat.
  5. Stop at <EOS> or max length.

Decoding strategies

MethodBehaviorTrade-off
GreedyPick best token each stepFast, deterministic, can be myopic
Beam searchTrack top candidate sequencesBetter global quality, slower
Top-k samplingSample from top-k tokensControlled diversity
Top-p samplingSample from dynamic probability massAdaptive diversity

Common generation controls