Reading notes: Attention Is All You Need (Vaswani et al., 2017)¶

The paper that launched a thousand follow-ups. Worth re-reading periodically because the original is much shorter and more focused than the discourse around it suggests.

What the paper actually proposes¶

Replace recurrence and convolution with self-attention as the sole sequence transduction mechanism. The Transformer architecture.

Three key ideas, none of which were entirely new on their own:

Multi-head attention — multiple parallel attention computations, concatenated.
Positional encoding — sinusoidal, no learned position embeddings in the original.
Layer normalisation and residual connections throughout.

What surprised me on re-reading¶

The training time argument is the headline claim and it is doing a lot of work. The architectural elegance gets all the attention (no pun intended) but the parallelism story is the actual contribution. RNNs were not slow because they were RNNs; they were slow because the sequential dependency prevented GPU parallelism.

What was not in the paper¶

BPE tokenisation considerations (came later).
Pre-norm vs post-norm (the paper uses post-norm; pre-norm is now standard).
Layer-wise learning rate scaling.
Anything about scaling laws.

Worth re-reading alongside¶

Bahdanau et al. — Neural Machine Translation by Jointly Learning to Align and Translate (2015). The attention mechanism the Transformer paper builds on.
Devlin et al. — BERT (2018). What happens when you encode-only.
Radford et al. — GPT-2 paper (2019). What happens when you decode-only.