semantic-search — paul rollo

What is latent semantic indexing (LSI) and how does it work? meilisearch.com

Meilisearch’s breakdown of LSI positions it as a foundational retrieval method that utilises Singular Value Decomposition (SVD) to reduce high-dimensional term-document matrices into a lower-dimensional "latent space." By decomposing the original matrix into three constituent matrices (U, Σ, and Vᵀ), LSI captures hidden conceptual relationships (e.g., grouping "physician" and "doctor"), thereby addressing the retrieval failures of exact-match keyword systems. While computationally efficient for small, static datasets, they highlights that LSI's linear algebraic approach is increasingly superseded by Transformer-based embeddings and Vector Search, which offer superior scalability and deeper contextual understanding of polysemy and linguistic nuance in dynamic web environments.

Introducing the Knowledge Graph: things, not strings blog.google

Google's Knowledge Graph is a structured entity database that maps real-world objects - people, places, organisations, and concepts - to semantically rich attribute sets and inter-entity relationships, replacing string-matched keyword lookup with disambiguated, meaning-based retrieval. The system resolves lexical ambiguity (e.g., "Taj Mahal" as monument vs. musician vs. restaurant) by anchoring queries to canonical entities with unique identifiers, drawing from synthesised sources including Freebase, Wikipedia, and the CIA World Factbook to populate typed properties and relational edges. This shifts ranking and indexing logic from document-to-keyword co-occurrence toward entity-to-entity graph traversal, enabling query expansion, direct answer surfacing, and contextual result clustering without requiring exact-match signals in crawled content.

Could Google passage indexing be leveraging BERT? searchengineland.com

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding arxiv.org

BERT (Bidirectional Encoder Representations from Transformers) pre-trains a deep transformer architecture using masked language modeling (MLM) and next sentence prediction (NSP) on unlabeled text, enabling simultaneous left-and-right context conditioning across all layers rather than the unidirectional or shallow-bidirectional approaches of predecessor models. Fine-tuned BERT established state-of-the-art performance on 11 NLP benchmarks - including a 7.7% absolute improvement on the GLUE score and 1.5 F1-point gain on SQuAD v1.1 - by learning rich, context-dependent token representations transferable to downstream tasks with minimal task-specific architecture modification. BERT's deep bi-directionality enables query-document semantic matching that captures polysemous terms, long-range syntactic dependencies, and implicit query intent, directly improving relevance ranking signals beyond keyword co-occurrence and making it deployable as a reranker layer over candidate retrieval sets.

What are LSI Keywords and What I Use Instead of Them? seobythesea.com

Bill Slawski’s analysis of "LSI Keywords" identifies them as a persistent SEO industry myth, debunking the notion that Google utilises 1980s-era Latent Semantic Indexing - a method designed for small, static corpora - to rank dynamic web content. The post’s core thesis is that while "LSI" is an obsolete term in modern IR, Google achieves similar semantic goals through Phrase-Based Indexing and Context Vectors, which identify topically related "co-occurring phrases" (e.g., "pitcher’s mound" for a page about "baseball") to verify a document's topical depth. This necessitates a shift from keyword-stuffing synonyms to entity-based content construction, where ranking durability is driven by the presence of predictive, domain-specific terms that mathematically confirm a page's relevance to its primary subject.

Efficient Estimation of Word Representations in Vector Space arxiv.org

Google BERT Update - What it Means searchenginejournal.com

links