🔎BM25-Based Searching: A Developer’s Comprehensive Guide

📌 Introduction: Why BM25 Matters

Imagine you type “best Python tutorials” into a search engine. Millions of web pages match your query—but how does the engine know which pages are most relevant?

At the core of modern search ranking lies Information Retrieval (IR). One of the most robust and widely-used ranking algorithms in lexical search is BM25 (Best Matching 25), part of the Okapi probabilistic retrieval family.

What you’ll learn in this article:

  • How BM25 ranks documents and handles term frequency and document length.
  • Differences between BM25 and TF-IDF.
  • Practical Python implementation with Rank-BM25.
  • BM25 variants, optimizations, and hybrid search integration.
  • Applications, advantages, and limitations in real-world systems.

By the end, you’ll be ready to implement BM25 in search systems and combine it with modern retrieval methods.

1️⃣ What is BM25?

BM25 is a ranking function estimating how relevant a document D is for a query Q.

The following diagram illustrates the BM25 (Best Match 25) ranking algorithm pipeline, which is used to score documents against a search query.

1.1 Query

The starting point—your search terms that need to be matched against documents in a corpus.

1.2 TF Adjustment (Term Frequency)

This stage calculates how often query terms appear in each document, but with a saturation function to prevent overly long documents from dominating. BM25 uses:

TF_adjusted = (f × (k₁ + 1)) / (f + k₁ × (1 – b + b × (|D| / avgDL)))

Where:

  • f = raw term frequency in the document
  • k₁ = controls term frequency saturation (typically 1.2-2.0)
  • b = controls length normalization influence (typically 0.75)
  • |D| = document length
  • avgDL = average document length in corpus

1.3 IDF Weighting (Inverse Document Frequency)

This assigns importance to terms based on their rarity across the corpus. Common words get lower weights, rare words get higher weights:

IDF = log((N – n + 0.5) / (n + 0.5))

Where:

  • N = total number of documents
  • n = number of documents containing the term

1.4 Length Normalization

This is actually embedded in the TF adjustment (via the b parameter), but conceptually it prevents longer documents from having unfair advantages simply due to containing more words.

1.5 Score

The final BM25 score is computed by summing the contributions of all query terms:

BM25(D,Q) = Σ (IDF(qᵢ) × TF_adjusted(qᵢ, D))

This produces a relevance score for ranking documents against the query.

2️⃣ BM25 vs TF-IDF

BM25 and TF-IDF are both popular algorithms for ranking documents in information retrieval, but they approach relevance differently. TF-IDF scores a document based on how frequently a query term appears (term frequency, TF) and how rare the term is across all documents (inverse document frequency, IDF). However, it treats term frequency linearly and doesn’t account for document length, which can skew results. BM25, on the other hand, builds on TF-IDF by introducing a saturation effect for term frequency—so repeating a word excessively doesn’t overly boost relevance—and normalizes for document length, making it more effective for longer texts. Overall, BM25 is generally considered more robust and accurate in modern search engines compared to classic TF-IDF.

FeatureTF-IDFBM25
Term frequencyLinearSaturated (non-linear)
Document length normalizationOptionalBuilt-in
IDF smoothingRarelySmoothed with 0.5
Tunable parametersNonek1​, b
Practical performanceGood for small datasetsExcellent for large corpora

3️ Practical Implementation in Python

Required library:

pip install nltk rank-bm25

Python code example:

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus.reader.wordnet import NOUN, VERB, ADJ, ADV
from rank_bm25 import BM25Plus

# ----------------------------
# Download NLTK resources
# ----------------------------
nltk.download("punkt", quiet=True)
nltk.download("punkt_tab", quiet=True)
nltk.download("stopwords", quiet=True)
nltk.download("wordnet", quiet=True)
nltk.download("omw-1.4", quiet=True)
nltk.download("averaged_perceptron_tagger", quiet=True)
nltk.download("averaged_perceptron_tagger_eng", quiet=True)

# ----------------------------
# Preprocessing setup
# ----------------------------
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(tag: str):
    if tag.startswith("J"):
        return ADJ
    elif tag.startswith("V"):
        return VERB
    elif tag.startswith("N"):
        return NOUN
    elif tag.startswith("R"):
        return ADV
    else:
        return NOUN

def preprocess(text: str):
    tokens = word_tokenize(text.lower())
    tokens = [t for t in tokens if t.isalnum()]
    tokens = [t for t in tokens if t not in stop_words]
    pos_tags = nltk.pos_tag(tokens)
    return [lemmatizer.lemmatize(t, get_wordnet_pos(pos)) for t, pos in pos_tags]

# ----------------------------
# Example corpus
# ----------------------------
corpus = [
    "Python is a popular programming language for data science and AI.",
    "Machine learning and deep learning are subsets of artificial intelligence.",
    "The fox is quick and brown, jumping over a lazy dog.",
    "Developers use Python for natural language processing and search engines.",
    "Dogs are loyal animals, often considered man's best friend."
]

tokenized_corpus = [preprocess(doc) for doc in corpus]

# ----------------------------
# Initialize BM25Plus with parameter tuning
# ----------------------------
k1 = 1.5  # term frequency saturation
b = 0.75  # length normalization
bm25 = BM25Plus(tokenized_corpus, k1=k1, b=b)

# ----------------------------
# Query
# ----------------------------
query = "python search ai"
tokenized_query = preprocess(query)

# ----------------------------
# Compute scores
# ----------------------------
scores = bm25.get_scores(tokenized_query)

# ----------------------------
# Rank documents
# ----------------------------
ranked = sorted(zip(scores, corpus), key=lambda x: x[0], reverse=True)

print(f"Query: {query}\n")
print("Ranked Results with k1 =", k1, "and b =", b)
for score, doc in ranked:
    print(f"{score:.4f} -> {doc}")

Output:

(env) D:\projects\ranjankumar.in\posts\bm25>python bm25.py
Query: python search ai

Ranked Results with k1 = 1.5 and b = 0.75
7.6091 -> Python is a popular programming language for data science and AI.
7.4349 -> Developers use Python for natural language processing and search engines.
4.6821 -> Machine learning and deep learning are subsets of artificial intelligence.
4.6821 -> The fox is quick and brown, jumping over a lazy dog.
4.6821 -> Dogs are loyal animals, often considered man's best friend.

✅ What this script demonstrates:

  1. Preprocessing pipeline:
    • Converts text to lowercase
    • Removes punctuation
    • Removes stopwords
    • Lemmatizes using POS tags
  2. BM25Plus scoring:
    • Assigns higher scores to documents that match query tokens
    • Avoids negative scores (common in small corpora)
  3. Ranking documents:
    • Displays the most relevant documents first

✅ Parameter Tuning

BM25 has two main tunable parameters:

  1. k1 – controls term frequency saturation
    • Higher k1 → repeated terms matter more
    • Typical range: 1.2 – 2.0
  2. b – controls document length normalization
    • b=1 → full normalization (long docs penalized)
    • b=0 → no normalization (like TF only)
    • Typical range: 0.5 – 0.8

We add these as arguments when initializing BM25Plus and make it flexible to tune per corpus.

How to tune these parameters?

  • Short documents (tweets, messages): lower b → reduces length normalization
  • Long documents (articles, reports): higher b → penalizes very long docs
  • k1: adjust depending on whether repeated terms should contribute more

Example experimentation:

k1bObservations
1.20.5Short docs weighted less by term repetition
1.50.75Default, works well for medium-length docs
2.00.8Long documents get penalized less, repeated terms matter more

4️⃣ Integrating BM25 with Embeddings (Hybrid Search)

Problem:
BM25 is purely lexical — it cannot capture semantic similarity. Two documents with different words but same meaning (synonyms, paraphrases) are missed.

Solution:
Combine BM25 with dense vector embeddings (from BERT, SentenceTransformers, etc.):

  • BM25 → captures exact matches
  • Embeddings → captures semantic matches
  • Final score → weighted combination of BM25 + embedding similarity

Benefits:

  • Achieves high recall + semantic understanding
  • Often called hybrid retrieval
  • Works well for question-answering, document search, recommendation systems

Python Sketch:

bm25_scores = bm25.get_scores(tokenized_query)
embedding_scores = compute_embedding_scores(query, corpus)  # cosine similarity
final_scores = 0.7 * bm25_scores + 0.3 * embedding_scores

Other Optimizations:

  1. Query Expansion:
    • Expand queries with synonyms or related terms to increase recall.
    • e.g., "AI" → "artificial intelligence"
  2. Stopword & Lemmatization Optimization:
    • Remove or retain stopwords depending on corpus.
    • Lemmatization reduces word form mismatch.
  3. Weighted BM25:
    • Assign weights to fields (title, body, tags) for more structured search.
    • e.g., score = 2*title_score + body_score

5️⃣ Applications of BM25

  • Search engines: Elasticsearch, Solr, Google-like search
  • QA systems: Ranking candidate documents for neural models
  • Recommendation engines: Find relevant items from textual metadata
  • Legal / academic search: Efficient retrieval from large corpora

Note: E-commerce search often combines BM25 + embeddings to improve ranking of product descriptions and reviews.

6️ Advantages and Limitations

✅ Advantages

  • Simple and interpretable
  • Efficient on large corpora
  • Strong baseline for retrieval tasks
  • Tunable parameters allow corpus-specific optimization

❌ Limitations

  • Only handles exact lexical matches
  • Struggles with synonyms or paraphrased queries
  • Modern neural retrieval can outperform in semantic tasks
  • Hybrid BM25 + embeddings often needed for semantic search

🏁 Conclusion: Roadmap for Developers

  1. Start with BM25 for document ranking.
  2. Experiment with BM25+ or BM25L for large corpora.
  3. Combine with embeddings for hybrid search.
  4. Tune k1 and b based on your corpus.
  5. Explore neural ranking models (BERT-based, etc.) for semantic similarity.

BM25 remains the gold standard for lexical retrieval, balancing interpretability, efficiency, and performance.

📚 References

1 Comment

Leave a Comment

Your email address will not be published. Required fields are marked *