links

The Probabilistic Relevance Framework: BM25 and Beyond staff.city.ac.uk pdf

BM25 (Best Match 25) operationalises the Probabilistic Relevance Framework (PRF) by modelling document relevance as a probability estimate derived from term frequency, inverse document frequency, and document length normalisation, combining these signals through a tuneable saturating TF component (controlled by parameters k1 and b) to score documents against queries. The critical mechanism is the non-linear TF saturation curve, which prevents high-frequency terms from dominating relevance scores disproportionately, while the b parameter normalises document length against corpus averages, penalising verbose documents that accumulate term counts artificially. BM25 provides a computationally efficient, parameter-interpretable baseline that outperforms raw TF-IDF by handling term redundancy and document length bias - making it the de facto retrieval function for inverted-index architectures where lexical matching must approximate probabilistic relevance without requiring training data or vector embeddings.