Why BM25 Queries with More Terms Can Be Faster
Why BM25 Queries with More Terms Can Be Faster
I discovered something weird about BM25 full-text search: queries with MORE terms can sometimes be FASTER than queries with fewer terms. This broke my mental model of how search works.
What Is BM25?
BM25 (Best Matching 25) is the algorithm behind most full-text search engines: Elasticsearch, PostgreSQL’s full-text search, SQLite FTS, etc.
It ranks documents by how well they match a query, considering:
- Term frequency: How often does the word appear?
- Inverse document frequency: How rare is the word across all documents?
- Document length: Normalize for long vs short documents
score = Σ IDF(q_i) · (f(q_i, D) · (k_1 + 1)) / (f(q_i, D) + k_1 · (1 - b + b · |D| / avgdl))
Where:
- q_i: Query term
- D: Document
- f(q_i, D): Term frequency in document
- |D|: Document length
- avgdl: Average document length
- k_1, b: Tuning parameters
The Counterintuitive Behavior
Test setup: 1M documents, PostgreSQL full-text search.
-- One-term query (common word)
SELECT * FROM documents
WHERE tsv @@ to_tsquery('database')
ORDER BY ts_rank(tsv, to_tsquery('database')) DESC
LIMIT 10;
-- Time: 145ms
-- Three-term query
SELECT * FROM documents
WHERE tsv @@ to_tsquery('database & optimization & indexing')
ORDER BY ts_rank(tsv, to_tsquery('database & optimization & indexing')) DESC
LIMIT 10;
-- Time: 38ms
Three terms is 4x faster than one term. Why?
The Explanation
1. Posting List Intersection
Full-text search uses inverted indexes. Each word maps to a list of documents containing it:
"database" -> [doc1, doc2, doc3, doc4, ..., doc50000] (50k docs)
"optimization" -> [doc1, doc5, doc9, doc12, ..., doc8000] (8k docs)
"indexing" -> [doc2, doc5, doc8, doc9, ..., doc5000] (5k docs)
For a multi-term query with AND:
- Scan the SHORTEST posting list first
- Check if those docs are in other lists
- Only score documents in the intersection
"indexing" (5k docs)
∩ "optimization" (8k docs)
∩ "database" (50k docs)
= ~500 docs to score
2. Early Termination
When you need only top-N results, you can stop early:
1. Score first batch of candidate docs
2. Keep top 10
3. When next doc's max possible score < 10th place score: STOP
With more specific queries (more terms), you reach this threshold faster.
3. Score Bounds
BM25 scores have upper bounds based on IDF values. If you’re looking for top 10 and you’ve already found 10 docs with score > 5.0, you can skip checking docs where max possible score is < 5.0.
More query terms = higher threshold = more docs can be skipped.
When This Breaks Down
This speedup requires:
- Terms with different frequencies: One rare term to anchor on
- Top-N queries: Not scanning all results
- Good index: Posting lists sorted by doc ID or score
It breaks when:
- All terms are equally common (no good anchor)
- You need ALL results (can’t early-terminate)
- Terms are very rare (tiny posting lists, overhead dominates)
Practical Applications
1. Query Expansion
Add related terms to make queries MORE specific:
-- Slow
SELECT * FROM docs WHERE tsv @@ 'python';
-- Fast
SELECT * FROM docs WHERE tsv @@ 'python & programming & tutorial';
2. Negative Terms
Exclude common terms to narrow search:
SELECT * FROM docs
WHERE tsv @@ 'python & programming & !introduction'
LIMIT 10;
The !introduction excludes 30% of results, making intersection smaller.
3. Phrase Queries
Phrases are even better:
-- "web server" as phrase
SELECT * FROM docs WHERE tsv @@ '"web server"';
Phrase matching requires position checks, but operates on a tiny candidate set.
Benchmarks
Real query from our logs (1M document corpus):
Query: "database"
Matches: 89,432 docs
Time: 167ms
Query: "database optimization"
Matches: 3,421 docs
Time: 54ms
Query: "database optimization postgresql"
Matches: 412 docs
Time: 18ms
Query: "database optimization postgresql vacuum"
Matches: 47 docs
Time: 9ms
Each added term:
- Reduces candidate set by ~10x
- Improves query time by ~3x
Implementation Notes
This works because modern search engines:
- Sorted posting lists: Can skip ahead efficiently
- Score bounds: Calculate max possible scores without full evaluation
- Early termination: Stop when top-N found
- Heap-based ranking: Keep only top-N in memory
From Lucene source (simplified):
while (scorerQueue.size() > 0) {
Scorer scorer = scorerQueue.top();
int doc = scorer.docID();
float score = scorer.score();
if (score < minCompetitiveScore) {
// Skip this document
continue;
}
if (pq.size() < numHits || score > pq.top().score) {
pq.insertWithOverflow(new ScoreDoc(doc, score));
// Update minimum score threshold
minCompetitiveScore = pq.top().score;
}
}
Lessons
- More constraints = faster: Counterintuitively true for search
- Selectivity matters: One rare term beats many common terms
- Index structure enables optimization: Sorted posting lists are key
- Top-N is special: Different complexity than “scan all”
Try It Yourself
-- PostgreSQL example
CREATE TABLE docs (id INT, content TEXT);
CREATE INDEX idx_fts ON docs USING gin(to_tsvector('english', content));
-- Benchmark
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM docs
WHERE to_tsvector('english', content) @@ to_tsquery('term1 & term2 & term3')
LIMIT 10;
Watch how adding terms changes the execution plan.
Search is full of counterintuitive optimizations. This is just one of them.