natural-language-processing

NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE arxiv.org

Attention mechanism for encoder-decoder neural machine translation that dynamically computes soft alignments over all source tokens when generating each target token, replacing the fixed-length context vector bottleneck of prior RNN-based architectures. The model's alignment scores - derived from a learned compatibility function between decoder hidden states and encoder annotations - enable variable-length source representations, yielding state-of-the-art BLEU scores on English-French translation and demonstrating that performance no longer degrades on long sentences as sequence length increases. This attention paradigm directly underpins transformer-based language models (BERT, T5) used in semantic indexing and query-document relevance ranking, as the learned token-to-token alignment weights provide interpretable, context-sensitive representations that capture long-range lexical dependencies critical for passage retrieval and cross-lingual search quality.

What is latent semantic indexing (LSI) and how does it work? meilisearch.com

Meilisearch’s breakdown of LSI positions it as a foundational retrieval method that utilises Singular Value Decomposition (SVD) to reduce high-dimensional term-document matrices into a lower-dimensional "latent space." By decomposing the original matrix into three constituent matrices (U, Σ, and Vᵀ), LSI captures hidden conceptual relationships (e.g., grouping "physician" and "doctor"), thereby addressing the retrieval failures of exact-match keyword systems. While computationally efficient for small, static datasets, they highlights that LSI's linear algebraic approach is increasingly superseded by Transformer-based embeddings and Vector Search, which offer superior scalability and deeper contextual understanding of polysemy and linguistic nuance in dynamic web environments.

Introducing the Knowledge Graph: things, not strings blog.google

Google's Knowledge Graph is a structured entity database that maps real-world objects - people, places, organisations, and concepts - to semantically rich attribute sets and inter-entity relationships, replacing string-matched keyword lookup with disambiguated, meaning-based retrieval. The system resolves lexical ambiguity (e.g., "Taj Mahal" as monument vs. musician vs. restaurant) by anchoring queries to canonical entities with unique identifiers, drawing from synthesised sources including Freebase, Wikipedia, and the CIA World Factbook to populate typed properties and relational edges. This shifts ranking and indexing logic from document-to-keyword co-occurrence toward entity-to-entity graph traversal, enabling query expansion, direct answer surfacing, and contextual result clustering without requiring exact-match signals in crawled content.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer arxiv.org

T5 unifies all NLP tasks - classification, summarisation, and QA - into a text-to-text format, allowing a single transformer architecture to generalise across diverse content types. By introducing the C4 (Colossal Clean Crawled Corpus), the authors established a gold standard for web-scale data cleaning (deduplication and quality heuristics). Most significantly, the paper provides a systematic benchmark of pre-training objectives and scaling laws, proving that diverse language tasks can be mastered through unified transfer learning rather than task-specific engineering.

BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models arxiv.org

BEIR established a standardised framework of 18 diverse datasets (covering fact-checking, QA, and news) to measure zero-shot generalisation in Information Retrieval (IR). The benchmark's core finding is the "Generalisation Gap" - while dense retrieval models (like DPR) excel in-domain, they frequently underperform BM25 on out-of-domain tasks. This highlights a critical brittleness in neural IR. Explains the continued necessity of lexical matching (keywords) as a robust signal that complements semantic interpretation in diverse or "long-tail" query environments.

Could Google passage indexing be leveraging BERT? searchengineland.com

BERT Question-Answering at Google seobythesea.com

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding arxiv.org

BERT (Bidirectional Encoder Representations from Transformers) pre-trains a deep transformer architecture using masked language modeling (MLM) and next sentence prediction (NSP) on unlabeled text, enabling simultaneous left-and-right context conditioning across all layers rather than the unidirectional or shallow-bidirectional approaches of predecessor models. Fine-tuned BERT established state-of-the-art performance on 11 NLP benchmarks - including a 7.7% absolute improvement on the GLUE score and 1.5 F1-point gain on SQuAD v1.1 - by learning rich, context-dependent token representations transferable to downstream tasks with minimal task-specific architecture modification. BERT's deep bi-directionality enables query-document semantic matching that captures polysemous terms, long-range syntactic dependencies, and implicit query intent, directly improving relevance ranking signals beyond keyword co-occurrence and making it deployable as a reranker layer over candidate retrieval sets.

Efficient Estimation of Word Representations in Vector Space arxiv.org

Information Foraging Theory peterpirolli.com pdf

Information Foraging Theory adapts optimal foraging theory from behavioural ecology to model how humans allocate attention and navigation effort across information environments, treating users as rational agents maximising information gain per unit cost. The central mechanism, information scent, quantifies the proximal cues (link text, snippets, anchor context) users evaluate to predict distal information value, with patch exploitation decisions triggering site abandonment when marginal scent signals fall below inter-patch travel cost thresholds. For search systems, this framework demands that crawlers prioritise anchor-rich, semantically coherent link graphs, that ranking signals weight snippet-to-content semantic fidelity as a scent-accuracy proxy, and that indexing architectures surface high-scent pathway structures - since pages generating low click-through despite high impressions signal scent-content mismatch, a recoverable relevance failure distinct from authority or freshness deficits.

Google BERT Update - What it Means searchenginejournal.com

links