Why Transformers Replaced Earlier Seq2Seq Models

Before Transformers, sequence tasks used RNN/LSTM encoder-decoder models. They worked, but had major scaling and context limitations.

Pre-attention Seq2Seq bottleneck

Classical encoder–decoder RNNs read input token-by-token, then pass a single context vector to the decoder. For a 100-word paragraph, details from early words (e.g. noun number, gender) are squeezed into one fixed-size vector (often 512 dims) and degraded by later tokens.

flowchart LR E1[Encoder: The] --> E2[cat] --> E3[sat] E3 --> V[Context vector bottleneck] V --> D1[Decoder: Le] --> D2[chat] V -.-> L[Long inputs lose early detail]

Five fatal bottlenecks of classical sequence models

Fixed context bottleneck — one vector cannot faithfully store long inputs.
Vanishing / exploding gradients — error signals decay or blow up across long time chains.
Poor long-range dependencies — early token influence fades over many steps.
No training parallelism — step t waits for t−1, under-using GPUs.
Uniform input treatment — no dynamic focus on the most relevant source words.

Cross-attention (Bahdanau 2015) — first fix

Instead of one dead vector, the decoder keeps all encoder hidden states in a memory cache. At each decoding step it computes alignment weights α_ti — a spotlight over source positions — and builds a dynamic context: c_t = Σ α_ti · h_i. When translating “cat” → “chat”, weights might be [0.05, 0.90, 0.05] on “The / cat / sat”.

Why the Transformer was the next leap

Self-attention replaces recurrence — every token queries every token in parallel.
Dependency path length becomes O(1) across the sequence mesh vs O(n) along an RNN chain.
Full GPU parallelism — entire sentences processed as matrix batches.
Multi-head attention captures different relation types (syntax, semantics, position) simultaneously.

flowchart TB subgraph RNN["RNN cross-attention"] R1[Sequential O(n) steps] R2[Decoder queries encoder memory] end subgraph TR["Transformer self-attention"] T1[Parallel token mesh] T2[O(1) token-to-token links] end RNN --> TR