Before Transformers, sequence tasks used RNN/LSTM encoder-decoder models. They worked, but had major scaling and context limitations.
Pre-attention Seq2Seq bottleneck
Classical encoder–decoder RNNs read input token-by-token, then pass a single context vector to the decoder. For a 100-word paragraph, details from early words (e.g. noun number, gender) are squeezed into one fixed-size vector (often 512 dims) and degraded by later tokens.
flowchart LR
E1[Encoder: The] --> E2[cat] --> E3[sat]
E3 --> V[Context vector bottleneck]
V --> D1[Decoder: Le] --> D2[chat]
V -.-> L[Long inputs lose early detail]
Five fatal bottlenecks of classical sequence models
Fixed context bottleneck — one vector cannot faithfully store long inputs.
Vanishing / exploding gradients — error signals decay or blow up across long time chains.
Poor long-range dependencies — early token influence fades over many steps.
No training parallelism — step t waits for t−1, under-using GPUs.
Uniform input treatment — no dynamic focus on the most relevant source words.
Cross-attention (Bahdanau 2015) — first fix
Instead of one dead vector, the decoder keeps all encoder hidden states in a memory cache. At each decoding step it computes alignment weights αti — a spotlight over source positions — and builds a dynamic context: c_t = Σ α_ti · h_i. When translating “cat” → “chat”, weights might be [0.05, 0.90, 0.05] on “The / cat / sat”.
Why the Transformer was the next leap
Self-attention replaces recurrence — every token queries every token in parallel.
Dependency path length becomes O(1) across the sequence mesh vs O(n) along an RNN chain.
Full GPU parallelism — entire sentences processed as matrix batches.
Multi-head attention captures different relation types (syntax, semantics, position) simultaneously.