relevance-scoring + vector-space-model

A vector space model for automatic indexing ptabdata.blob.core.windows.net pdf

The vector space model represents documents and queries as weighted term vectors in high-dimensional space, enabling similarity computation via cosine measures rather than Boolean exact-match retrieval. Experiments on the SMART system demonstrate that term weighting schemes combining term frequency (TF) with inverse document frequency (IDF) consistently outperform binary indexing, with IDF-weighted vectors producing superior recall-precision tradeoffs across multiple test collections. This mechanism directly powers ranked retrieval systems by scoring documents against queries through continuous similarity values, replacing brittle keyword matching with a scalable, corpus-aware relevance signal that underlies modern inverted index scoring functions including BM25 and learning-to-rank feature generation.

Improvement of HITS-based Algorithms on Web Documents web.archive.org

Proposes modifications to the HITS algorithm that address link-spam vulnerabilities and topic drift by incorporating content similarity analysis and anchor text weighting into hub-authority score propagation. Experiments demonstrate that filtering semantically irrelevant links before iterative score computation reduces noise amplification, producing authority scores that more accurately reflect genuine topical relevance rather than raw link popularity. These refinements directly impact crawl prioritisation and authority-based ranking systems by making hub-authority scores resistant to manipulated link structures, improving the signal quality of link graph analysis for topical authority determination.

Information Foraging Theory peterpirolli.com pdf

Information Foraging Theory adapts optimal foraging theory from behavioural ecology to model how humans allocate attention and navigation effort across information environments, treating users as rational agents maximising information gain per unit cost. The central mechanism, information scent, quantifies the proximal cues (link text, snippets, anchor context) users evaluate to predict distal information value, with patch exploitation decisions triggering site abandonment when marginal scent signals fall below inter-patch travel cost thresholds. For search systems, this framework demands that crawlers prioritise anchor-rich, semantically coherent link graphs, that ranking signals weight snippet-to-content semantic fidelity as a scent-accuracy proxy, and that indexing architectures surface high-scent pathway structures - since pages generating low click-through despite high impressions signal scent-content mismatch, a recoverable relevance failure distinct from authority or freshness deficits.