Training for Greatness: Speed, BLEU Records, and the Multimodal Vision

March 21, 2026

Note: This post was AI-generated from rough notes using the blog generation workflow.

In 2017, a paper dropped that changed everything. “Attention Is All You Need” didn’t just introduce a new model — it invalidated an entire generation of architectures. If you’ve ever wondered why BERT, GPT, and every large language model you use today looks the way it does, the answer starts here: with two translation datasets, eight GPUs, and a deceptively simple idea.

The Datasets: Scale From the Start

The Transformer wasn’t tested on toy problems. The authors used two standard WMT 2014 benchmarks:

English-German: 4.5 million sentence pairs
English-French: 36 million sentence pairs

These are real, messy, production-scale datasets. The English-French set in particular is enormous — 36 million pairs is the kind of data that exposes architectural weaknesses fast. If your model can’t generalize at that scale, you’ll know.

Byte-pair encoding (BPE) was used to handle vocabulary, giving the model a shared subword vocabulary and letting it deal with rare words without blowing up the embedding size.

Training Speed: Where the Story Gets Interesting

Here’s the number that made people stop and read twice.

The base Transformer trained in 12 hours on 8 NVIDIA P100 GPUs. That’s it. One working day.

The big Transformer — the one that set state-of-the-art results — took 3.5 days on the same 8 GPUs.

Compare that to the competition at the time:

Model	Training Time
Transformer (big)	3.5 days, 8x P100
ConvS2S	9.5 days, 8x P100
GNMT ensemble	~6 weeks

The GNMT ensemble took roughly six weeks. The Transformer beat it in 3.5 days. That’s not an incremental improvement — that’s a different category of efficiency. And the reason is fundamental: because Transformers process sequences in parallel rather than step-by-step, they can actually saturate GPU compute in a way that RNNs and LSTMs structurally cannot.

This matters financially. Training costs dropped by an order of magnitude. Research that previously required serious institutional compute budget suddenly became accessible to smaller teams.

The Training Recipe

The authors were careful about training configuration, and the details matter for practitioners.

Optimizer: Adam with β₁ = 0.9, β₂ = 0.98, ε = 10⁻⁹, and a custom learning rate schedule:

lrate = d_model**(-0.5) * min(step**(-0.5), step * warmup_steps**(-1.5))

This schedule warms up linearly for the first warmup_steps (set to 4000), then decays proportionally to the inverse square root of the step number. The warmup is important — jumping straight to a high learning rate early in training destabilizes things.

Regularization:

Dropout with P = 0.1 applied to the output of each sub-layer before it’s added to the residual
Label smoothing with ε = 0.1 — instead of training the model to output a hard 1.0 for the correct token, you smooth the target distribution slightly

Label smoothing is easy to underestimate. It deliberately prevents the model from becoming overconfident, which improves generalization even though it hurts perplexity on the training set. In practice:

# Hard target: [0, 0, 1, 0, 0]
# Smoothed target with epsilon=0.1, vocab_size=5:
# [0.025, 0.025, 0.9, 0.025, 0.025]
smoothed = (1 - epsilon) * one_hot + epsilon / vocab_size

Small change. Meaningful impact on BLEU.

The BLEU Results: State of the Art, Decisively

Translation quality is measured by BLEU score — higher is better, and gains of even 0.5 points are typically considered significant.

English-to-German:

Transformer (big): 28.4 BLEU
Previous best: ~26.4 BLEU
Improvement: more than 2.0 BLEU points

That’s not a marginal win. In competitive MT benchmarks, 2 BLEU points is a substantial jump.

English-to-French:

Transformer (big): 41.0 BLEU
This single model outperformed all previous ensemble models

Ensemble models — where you combine multiple trained models at inference time — are a standard trick for squeezing out extra performance. The fact that a single Transformer beat the best ensembles on English-French is a strong signal that the architecture itself was capturing something fundamentally better.

Inference: Beam Search Configuration

At inference time, the authors used beam search rather than greedy decoding:

beam_size = 4
length_penalty_alpha = 0.6

Beam search keeps the top-k candidate sequences at each step instead of just the single best token. The length penalty prevents the model from favoring shorter sequences (which tend to score higher under raw log-probability):

score(Y, X) = log P(Y|X) / length_penalty(|Y|)
length_penalty(length) = ((5 + length) / (5 + 1)) ** alpha

With α = 0.6, longer sequences are penalized less aggressively, leading to better-calibrated output lengths. These aren’t magic numbers — they were tuned on the validation set — but they’re a solid starting point for any sequence generation task.

Generalizing Beyond Translation

One of the more quietly impressive results in the paper: the authors tested the Transformer on English constituency parsing — a structurally different task that requires predicting tree structure, not just token sequences.

With minimal task-specific tuning, it achieved competitive results against models specifically designed for parsing. This wasn’t the main event of the paper, but it was a signal that the architecture wasn’t just a translation trick. It was learning something more general about sequence relationships.

The Multimodal Vision

The authors didn’t stop at text. The closing sections of the paper lay out an explicit research agenda for extending attention-based models to images, audio, and video.

For long inputs — think high-resolution images or long audio sequences — full self-attention over every position becomes computationally expensive (O(n²) in sequence length). The proposed solution: restrict attention to local neighborhoods of the input. Instead of attending to every position, each position only attends to a fixed-size window around it. This keeps the computation tractable while preserving the core attention mechanism.

The paper’s closing line is worth quoting directly:

“We are excited about the future of attention-based models.”

In retrospect, that reads less like a conclusion and more like a mission statement. Within a year, BERT and GPT-1 both appeared — both Transformer-based, both citing this paper, both pushing the boundary of what was thought possible.

The Legacy

The lineage is direct and unambiguous:

BERT (2018): Transformer encoder, bidirectional, pretrained on masked language modeling
GPT-1 (2018): Transformer decoder, autoregressive, pretrained on language modeling
Every major LLM after that: GPT-2, GPT-3, GPT-4, LLaMA, Gemini, Claude — all Transformers

The architectural shift the 2017 paper initiated was this: from task-specific sequential models to a single general-purpose, fully parallelizable architecture. Before Transformers, the default assumption was that you needed different architectures for different tasks — CNNs for images, RNNs for sequences, task-specific layers for everything else. Transformers challenged that assumption and mostly won.

The Key Insight for Practitioners

If you take one thing from this post, make it this:

The Transformer’s quality comes from its ability to model relationships across the entire sequence simultaneously. When the model processes a token, it can directly attend to any other token in the input — not through a chain of recurrent states that degrades over distance, but through direct, learned attention weights.

This is why Transformers handle long-range dependencies better than LSTMs. It’s not about being deeper or having more data. It’s about the inductive bias of the architecture itself: every position can talk to every other position, and the model learns which conversations matter.

# Conceptually, attention says:
# for each query position, compute how much to weight each key position
attention_weights = softmax(Q @ K.T / sqrt(d_k))
output = attention_weights @ V
# Every position attends to every other position — simultaneously

That single operation, scaled up and stacked, is the foundation of modern AI.

Conclusion

The Transformer paper didn’t just set new benchmarks — it reset expectations about what was achievable, how fast, and at what cost. A 12-hour training run for a competitive base model. A 3.5-day run for state-of-the-art results. BLEU scores that invalidated years of ensemble engineering. And an architecture general enough to work on parsing, and eventually on images, audio, and everything else the authors imagined.

The future they were excited about in 2017 is the present we’re working in now. And it all started with attention.