links

A statistical interpretation of term specificity and its application in retrieval staff.city.ac.uk pdf

Establishes a probabilistic weighting framework that quantifies term specificity in document retrieval by formalising the inverse relationship between collection frequency and retrieval value. The paper derives Inverse Document Frequency (IDF) - calculated as the log of total documents divided by documents containing a term - demonstrating that rare terms carry disproportionately higher discriminatory power for isolating relevant documents from noise. Search ranking systems applying IDF-weighted term scoring achieve measurably superior precision over raw term-frequency matching, forming the mathematical foundation for TF-IDF signals that are useful for content relevance scoring, anchor text evaluation, and keyword targeting models.

The Probabilistic Relevance Framework: BM25 and Beyond staff.city.ac.uk pdf

BM25 (Best Match 25) operationalises the Probabilistic Relevance Framework (PRF) by modelling document relevance as a probability estimate derived from term frequency, inverse document frequency, and document length normalisation, combining these signals through a tuneable saturating TF component (controlled by parameters k1 and b) to score documents against queries. The critical mechanism is the non-linear TF saturation curve, which prevents high-frequency terms from dominating relevance scores disproportionately, while the b parameter normalises document length against corpus averages, penalising verbose documents that accumulate term counts artificially. BM25 provides a computationally efficient, parameter-interpretable baseline that outperforms raw TF-IDF by handling term redundancy and document length bias - making it the de facto retrieval function for inverted-index architectures where lexical matching must approximate probabilistic relevance without requiring training data or vector embeddings.