links

A vector space model for automatic indexing ptabdata.blob.core.windows.net pdf

The vector space model represents documents and queries as weighted term vectors in high-dimensional space, enabling similarity computation via cosine measures rather than Boolean exact-match retrieval. Experiments on the SMART system demonstrate that term weighting schemes combining term frequency (TF) with inverse document frequency (IDF) consistently outperform binary indexing, with IDF-weighted vectors producing superior recall-precision tradeoffs across multiple test collections. This mechanism directly powers ranked retrieval systems by scoring documents against queries through continuous similarity values, replacing brittle keyword matching with a scalable, corpus-aware relevance signal that underlies modern inverted index scoring functions including BM25 and learning-to-rank feature generation.

What is latent semantic indexing (LSI) and how does it work? meilisearch.com

Meilisearch’s breakdown of LSI positions it as a foundational retrieval method that utilises Singular Value Decomposition (SVD) to reduce high-dimensional term-document matrices into a lower-dimensional "latent space." By decomposing the original matrix into three constituent matrices (U, Σ, and Vᵀ), LSI captures hidden conceptual relationships (e.g., grouping "physician" and "doctor"), thereby addressing the retrieval failures of exact-match keyword systems. While computationally efficient for small, static datasets, they highlights that LSI's linear algebraic approach is increasingly superseded by Transformer-based embeddings and Vector Search, which offer superior scalability and deeper contextual understanding of polysemy and linguistic nuance in dynamic web environments.