information-retrieval + machine-learning

Large-scale Incremental Processing Using Distributed Transactions and Notifications static.googleusercontent.com pdf

Google research paper introducing Percolator, a system built on Bigtable that enables incremental processing of large datasets through distributed transactions and a notification-driven computation model. It replaced the traditional MapReduce batch-processing model, allowing Google to update its search index continuously as individual pages are crawled rather than waiting for a full global rebuild. The system uses a "snapshot isolation" technique to ensure data consistency across distributed tables, where "observers" (code snippets) are triggered by specific data changes to propagate updates through the indexing pipeline. This architecture underpins the shift from the "Google Dance" (monthly index refreshes) to the Caffeine update, providing the infrastructure for near-real-time discovery of content and backlinks, though the ultimate "propagation wave" through various ranking layers still prevents instantaneous global ranking changes.

Learning Deep Structured Semantic Models for Web Search using Clickthrough Data microsoft.com pdf

DSSM (Deep Structured Semantic Model) employs a deep neural network with a word hashing layer to project queries and documents into a shared low-dimensional semantic space, trained end-to-end on clickthrough data by maximising the posterior probability of clicked documents given queries. The model uses letter-trigram-based word hashing to reduce input dimensionality from 500K+ vocabulary terms to ~30K features, achieving statistically significant NDCG gains (~1-2% absolute) over BM25, LSA, and PLSA baselines in web search ranking tasks. This architecture enables ranking systems to overcome lexical mismatch between queries and documents - surfacing semantically relevant results where no keyword overlap exists - directly impacting relevance scoring layers in learning-to-rank pipelines without requiring manual feature engineering or query expansion modules.

Accurately Interpreting Clickthrough Data as Implicit Feedback cs.cornell.edu pdf

A relative preference model that reframes clickthrough data as comparative judgments between examined results rather than absolute relevance signals, using eye-tracking and controlled experiments to calibrate interpretation of user clicks. Identifies that absolute click rates carry strong presentation bias (position, snippet quality), but relative click patterns - specifically "clicked above non-clicked" pairs - yield reliable relevance signals robust to trust bias and ranking artefacts. Enables search systems to extract high-quality implicit feedback for learning-to-rank algorithms by mining pairwise preference constraints from click logs rather than treating raw click frequency as a direct relevance proxy.

BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models arxiv.org

BEIR established a standardised framework of 18 diverse datasets (covering fact-checking, QA, and news) to measure zero-shot generalisation in Information Retrieval (IR). The benchmark's core finding is the "Generalisation Gap" - while dense retrieval models (like DPR) excel in-domain, they frequently underperform BM25 on out-of-domain tasks. This highlights a critical brittleness in neural IR. Explains the continued necessity of lexical matching (keywords) as a robust signal that complements semantic interpretation in diverse or "long-tail" query environments.

Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion noon99jaki.github.io pdf

Web-scale probabilistic knowledge base that automatically fuses extracted facts from Web content with prior knowledge from existing knowledge bases (Freebase, OpenCyc, Wikidata) using a supervised machine learning pipeline combining extractions, graph-based inference, and calibrated confidence scoring. The system ingests 1.6 billion candidate facts, assigns calibrated probabilities via classifier ensembles and embedding-based propagation, and achieves a corpus of 271 million facts with ≥0.7 confidence—surpassing Freebase's human-curated 350 million facts in breadth while maintaining measurable precision. This architecture enables automated, continuously updated entity-attribute resolution at crawl scale, directly powering entity disambiguation, Knowledge Graph population, and confidence-weighted fact retrieval without reliance on manual curation bottlenecks.

Information Foraging Theory peterpirolli.com pdf

Information Foraging Theory adapts optimal foraging theory from behavioural ecology to model how humans allocate attention and navigation effort across information environments, treating users as rational agents maximising information gain per unit cost. The central mechanism, information scent, quantifies the proximal cues (link text, snippets, anchor context) users evaluate to predict distal information value, with patch exploitation decisions triggering site abandonment when marginal scent signals fall below inter-patch travel cost thresholds. For search systems, this framework demands that crawlers prioritise anchor-rich, semantically coherent link graphs, that ranking signals weight snippet-to-content semantic fidelity as a scent-accuracy proxy, and that indexing architectures surface high-scent pathway structures - since pages generating low click-through despite high impressions signal scent-content mismatch, a recoverable relevance failure distinct from authority or freshness deficits.