Skip to content

Reading notes: Attention Is All You Need (Vaswani et al., 2017)

The paper that launched a thousand follow-ups. Worth re-reading periodically because the original is much shorter and more focused than the discourse around it suggests.

What the paper actually proposes

Replace recurrence and convolution with self-attention as the sole sequence transduction mechanism. The Transformer architecture.

Three key ideas, none of which were entirely new on their own:

  1. Multi-head attention — multiple parallel attention computations, concatenated.
  2. Positional encoding — sinusoidal, no learned position embeddings in the original.
  3. Layer normalisation and residual connections throughout.

What surprised me on re-reading

The training time argument is the headline claim and it is doing a lot of work. The architectural elegance gets all the attention (no pun intended) but the parallelism story is the actual contribution. RNNs were not slow because they were RNNs; they were slow because the sequential dependency prevented GPU parallelism.

What was not in the paper

  • BPE tokenisation considerations (came later).
  • Pre-norm vs post-norm (the paper uses post-norm; pre-norm is now standard).
  • Layer-wise learning rate scaling.
  • Anything about scaling laws.

Worth re-reading alongside

  • Bahdanau et al. — Neural Machine Translation by Jointly Learning to Align and Translate (2015). The attention mechanism the Transformer paper builds on.
  • Devlin et al. — BERT (2018). What happens when you encode-only.
  • Radford et al. — GPT-2 paper (2019). What happens when you decode-only.