Reading notes: Attention Is All You Need (Vaswani et al., 2017)¶
The paper that launched a thousand follow-ups. Worth re-reading periodically because the original is much shorter and more focused than the discourse around it suggests.
What the paper actually proposes¶
Replace recurrence and convolution with self-attention as the sole sequence transduction mechanism. The Transformer architecture.
Three key ideas, none of which were entirely new on their own:
- Multi-head attention — multiple parallel attention computations, concatenated.
- Positional encoding — sinusoidal, no learned position embeddings in the original.
- Layer normalisation and residual connections throughout.
What surprised me on re-reading¶
The training time argument is the headline claim and it is doing a lot of work. The architectural elegance gets all the attention (no pun intended) but the parallelism story is the actual contribution. RNNs were not slow because they were RNNs; they were slow because the sequential dependency prevented GPU parallelism.
What was not in the paper¶
- BPE tokenisation considerations (came later).
- Pre-norm vs post-norm (the paper uses post-norm; pre-norm is now standard).
- Layer-wise learning rate scaling.
- Anything about scaling laws.
Worth re-reading alongside¶
- Bahdanau et al. — Neural Machine Translation by Jointly Learning to Align and Translate (2015). The attention mechanism the Transformer paper builds on.
- Devlin et al. — BERT (2018). What happens when you encode-only.
- Radford et al. — GPT-2 paper (2019). What happens when you decode-only.