Training for Greatness: Speed, BLEU Records, and the Multimodal Vision
Note: This post was AI-generated from rough notes using the blog generation workflow.
In 2017, a paper dropped that changed everything. “Attention Is All You Need” didn’t just introduce a new model — it invalidated an entire generation of architectures. If you’ve ever wondered why BERT, GPT, and every large language model you use today looks the way it does, the answer starts here: with two translation datasets, eight GPUs, and a deceptively simple idea.
The Datasets: Scale From the Start
The Transformer wasn’t tested on toy problems. The authors used two standard WMT 2014 benchmarks:
- English-German: 4.5 million sentence pairs
- English-French: 36 million sentence pairs
These are real, messy, production-scale datasets. The English-French set in particular is enormous — 36 million pairs is the kind of data that exposes architectural weaknesses fast. If your model can’t generalize at that scale, you’ll know.
Byte-pair encoding (BPE) was used to handle vocabulary, giving the model a shared subword vocabulary and letting it deal with rare words without blowing up the embedding size.
Training Speed: Where the Story Gets Interesting
Here’s the number that made people stop and read twice.
The base Transformer trained in 12 hours on 8 NVIDIA P100 GPUs. That’s it. One working day.
The big Transformer — the one that set state-of-the-art results — took 3.5 days on the same 8 GPUs.
Compare that to the competition at the time:
| Model | Training Time |
|---|---|
| Transformer (big) | 3.5 days, 8x P100 |
| ConvS2S | 9.5 days, 8x P100 |
| GNMT ensemble | ~6 weeks |
The GNMT ensemble took roughly six weeks. The Transformer beat it in 3.5 days. That’s not an incremental improvement — that’s a different category of efficiency. And the reason is fundamental: because Transformers process sequences in parallel rather than step-by-step, they can actually saturate GPU compute in a way that RNNs and LSTMs structurally cannot.
This matters financially. Training costs dropped by an order of magnitude. Research that previously required serious institutional compute budget suddenly became accessible to smaller teams.
The Training Recipe
The authors were careful about training configuration, and the details matter for practitioners.
Optimizer: Adam with β₁ = 0.9, β₂ = 0.98, ε = 10⁻⁹, and a custom learning rate schedule:
lrate = d_model**(-0.5) * min(step**(-0.5), step * warmup_steps**(-1.5))
This schedule warms up linearly for the first warmup_steps (set to 4000), then decays proportionally to the inverse square root of the step number. The warmup is important — jumping straight to a high learning rate early in training destabilizes things.
Regularization:
- Dropout with P = 0.1 applied to the output of each sub-layer before it’s added to the residual
- Label smoothing with ε = 0.1 — instead of training the model to output a hard 1.0 for the correct token, you smooth the target distribution slightly
Label smoothing is easy to underestimate. It deliberately prevents the model from becoming overconfident, which improves generalization even though it hurts perplexity on the training set. In practice:
# Hard target: [0, 0, 1, 0, 0]
# Smoothed target with epsilon=0.1, vocab_size=5:
# [0.025, 0.025, 0.9, 0.025, 0.025]
smoothed = (1 - epsilon) * one_hot + epsilon / vocab_size
Small change. Meaningful impact on BLEU.
The BLEU Results: State of the Art, Decisively
Translation quality is measured by BLEU score — higher is better, and gains of even 0.5 points are typically considered significant.
English-to-German:
- Transformer (big): 28.4 BLEU
- Previous best: ~26.4 BLEU
- Improvement: more than 2.0 BLEU points
That’s not a marginal win. In competitive MT benchmarks, 2 BLEU points is a substantial jump.
English-to-French:
- Transformer (big): 41.0 BLEU
- This single model outperformed all previous ensemble models
Ensemble models — where you combine multiple trained models at inference time — are a standard trick for squeezing out extra performance. The fact that a single Transformer beat the best ensembles on English-French is a strong signal that the architecture itself was capturing something fundamentally better.
Inference: Beam Search Configuration
At inference time, the authors used beam search rather than greedy decoding:
beam_size = 4
length_penalty_alpha = 0.6
Beam search keeps the top-k candidate sequences at each step instead of just the single best token. The length penalty prevents the model from favoring shorter sequences (which tend to score higher under raw log-probability):
score(Y, X) = log P(Y|X) / length_penalty(|Y|)
length_penalty(length) = ((5 + length) / (5 + 1)) ** alpha
With α = 0.6, longer sequences are penalized less aggressively, leading to better-calibrated output lengths. These aren’t magic numbers — they were tuned on the validation set — but they’re a solid starting point for any sequence generation task.
Generalizing Beyond Translation
One of the more quietly impressive results in the paper: the authors tested the Transformer on English constituency parsing — a structurally different task that requires predicting tree structure, not just token sequences.
With minimal task-specific tuning, it achieved competitive results against models specifically designed for parsing. This wasn’t the main event of the paper, but it was a signal that the architecture wasn’t just a translation trick. It was learning something more general about sequence relationships.
The Multimodal Vision
The authors didn’t stop at text. The closing sections of the paper lay out an explicit research agenda for extending attention-based models to images, audio, and video.
For long inputs — think high-resolution images or long audio sequences — full self-attention over every position becomes computationally expensive (O(n²) in sequence length). The proposed solution: restrict attention to local neighborhoods of the input. Instead of attending to every position, each position only attends to a fixed-size window around it. This keeps the computation tractable while preserving the core attention mechanism.
The paper’s closing line is worth quoting directly:
“We are excited about the future of attention-based models.”
In retrospect, that reads less like a conclusion and more like a mission statement. Within a year, BERT and GPT-1 both appeared — both Transformer-based, both citing this paper, both pushing the boundary of what was thought possible.
The Legacy
The lineage is direct and unambiguous:
- BERT (2018): Transformer encoder, bidirectional, pretrained on masked language modeling
- GPT-1 (2018): Transformer decoder, autoregressive, pretrained on language modeling
- Every major LLM after that: GPT-2, GPT-3, GPT-4, LLaMA, Gemini, Claude — all Transformers
The architectural shift the 2017 paper initiated was this: from task-specific sequential models to a single general-purpose, fully parallelizable architecture. Before Transformers, the default assumption was that you needed different architectures for different tasks — CNNs for images, RNNs for sequences, task-specific layers for everything else. Transformers challenged that assumption and mostly won.
The Key Insight for Practitioners
If you take one thing from this post, make it this:
The Transformer’s quality comes from its ability to model relationships across the entire sequence simultaneously. When the model processes a token, it can directly attend to any other token in the input — not through a chain of recurrent states that degrades over distance, but through direct, learned attention weights.
This is why Transformers handle long-range dependencies better than LSTMs. It’s not about being deeper or having more data. It’s about the inductive bias of the architecture itself: every position can talk to every other position, and the model learns which conversations matter.
# Conceptually, attention says:
# for each query position, compute how much to weight each key position
attention_weights = softmax(Q @ K.T / sqrt(d_k))
output = attention_weights @ V
# Every position attends to every other position — simultaneously
That single operation, scaled up and stacked, is the foundation of modern AI.
Conclusion
The Transformer paper didn’t just set new benchmarks — it reset expectations about what was achievable, how fast, and at what cost. A 12-hour training run for a competitive base model. A 3.5-day run for state-of-the-art results. BLEU scores that invalidated years of ensemble engineering. And an architecture general enough to work on parsing, and eventually on images, audio, and everything else the authors imagined.
The future they were excited about in 2017 is the present we’re working in now. And it all started with attention.