Why Transformers Replaced Earlier Seq2Seq Models

Before Transformers, sequence tasks used RNN/LSTM encoder-decoder models. They worked, but had major scaling and context limitations.

Pre-attention Seq2Seq bottleneck

Classical encoder–decoder RNNs read input token-by-token, then pass a single context vector to the decoder. For a 100-word paragraph, details from early words (e.g. noun number, gender) are squeezed into one fixed-size vector (often 512 dims) and degraded by later tokens.

flowchart LR E1[Encoder: The] --> E2[cat] --> E3[sat] E3 --> V[Context vector bottleneck] V --> D1[Decoder: Le] --> D2[chat] V -.-> L[Long inputs lose early detail]

Five fatal bottlenecks of classical sequence models

Cross-attention (Bahdanau 2015) — first fix

Instead of one dead vector, the decoder keeps all encoder hidden states in a memory cache. At each decoding step it computes alignment weights αti — a spotlight over source positions — and builds a dynamic context: c_t = Σ α_ti · h_i. When translating “cat” → “chat”, weights might be [0.05, 0.90, 0.05] on “The / cat / sat”.

Why the Transformer was the next leap

flowchart TB subgraph RNN["RNN cross-attention"] R1[Sequential O(n) steps] R2[Decoder queries encoder memory] end subgraph TR["Transformer self-attention"] T1[Parallel token mesh] T2[O(1) token-to-token links] end RNN --> TR