links
Attention mechanism for encoder-decoder neural machine translation that dynamically computes soft alignments over all source tokens when generating each target token, replacing the fixed-length context vector bottleneck of prior RNN-based architectures. The model's alignment scores - derived from a learned compatibility function between decoder hidden states and encoder annotations - enable variable-length source representations, yielding state-of-the-art BLEU scores on English-French translation and demonstrating that performance no longer degrades on long sentences as sequence length increases. This attention paradigm directly underpins transformer-based language models (BERT, T5) used in semantic indexing and query-document relevance ranking, as the learned token-to-token alignment weights provide interpretable, context-sensitive representations that capture long-range lexical dependencies critical for passage retrieval and cross-lingual search quality.
T5 unifies all NLP tasks - classification, summarisation, and QA - into a text-to-text format, allowing a single transformer architecture to generalise across diverse content types. By introducing the C4 (Colossal Clean Crawled Corpus), the authors established a gold standard for web-scale data cleaning (deduplication and quality heuristics). Most significantly, the paper provides a systematic benchmark of pre-training objectives and scaling laws, proving that diverse language tasks can be mastered through unified transfer learning rather than task-specific engineering.