đ Introduction: Why BM25 Matters
Imagine you type âbest Python tutorialsâ into a search engine. Millions of web pages match your queryâbut how does the engine know which pages are most relevant?
At the core of modern search ranking lies Information Retrieval (IR). One of the most robust and widely-used ranking algorithms in lexical search is BM25 (Best Matching 25), part of the Okapi probabilistic retrieval family.
What youâll learn in this article:
- How BM25 ranks documents and handles term frequency and document length.
- Differences between BM25 and TF-IDF.
- Practical Python implementation with Rank-BM25.
- BM25 variants, optimizations, and hybrid search integration.
- Applications, advantages, and limitations in real-world systems.
By the end, youâll be ready to implement BM25 in search systems and combine it with modern retrieval methods.
1ď¸âŁ What is BM25?
BM25 is a ranking function estimating how relevant a document D is for a query Q.
The following diagram illustrates the BM25 (Best Match 25) ranking algorithm pipeline, which is used to score documents against a search query.

1.1 Query
The starting pointâyour search terms that need to be matched against documents in a corpus.
1.2 TF Adjustment (Term Frequency)
This stage calculates how often query terms appear in each document, but with a saturation function to prevent overly long documents from dominating. BM25 uses:
TF_adjusted = (f Ă (kâ + 1)) / (f + kâ Ă (1 – b + b Ă (|D| / avgDL)))
Where:
- f = raw term frequency in the document
- kâ = controls term frequency saturation (typically 1.2-2.0)
- b = controls length normalization influence (typically 0.75)
- |D| = document length
- avgDL = average document length in corpus
1.3 IDF Weighting (Inverse Document Frequency)
This assigns importance to terms based on their rarity across the corpus. Common words get lower weights, rare words get higher weights:
IDF = log((N – n + 0.5) / (n + 0.5))
Where:
- N = total number of documents
- n = number of documents containing the term
1.4 Length Normalization
This is actually embedded in the TF adjustment (via the b parameter), but conceptually it prevents longer documents from having unfair advantages simply due to containing more words.
1.5 Score
The final BM25 score is computed by summing the contributions of all query terms:
BM25(D,Q) = Σ (IDF(qᾢ) à TF_adjusted(qᾢ, D))
This produces a relevance score for ranking documents against the query.
2ď¸âŁ BM25 vs TF-IDF
BM25 and TF-IDF are both popular algorithms for ranking documents in information retrieval, but they approach relevance differently. TF-IDF scores a document based on how frequently a query term appears (term frequency, TF) and how rare the term is across all documents (inverse document frequency, IDF). However, it treats term frequency linearly and doesnât account for document length, which can skew results. BM25, on the other hand, builds on TF-IDF by introducing a saturation effect for term frequencyâso repeating a word excessively doesnât overly boost relevanceâand normalizes for document length, making it more effective for longer texts. Overall, BM25 is generally considered more robust and accurate in modern search engines compared to classic TF-IDF.
Feature | TF-IDF | BM25 |
---|---|---|
Term frequency | Linear | Saturated (non-linear) |
Document length normalization | Optional | Built-in |
IDF smoothing | Rarely | Smoothed with 0.5 |
Tunable parameters | None | k1â, b |
Practical performance | Good for small datasets | Excellent for large corpora |
3ď¸ Practical Implementation in Python
Required library:
pip install nltk rank-bm25
Python code example:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus.reader.wordnet import NOUN, VERB, ADJ, ADV
from rank_bm25 import BM25Plus
# ----------------------------
# Download NLTK resources
# ----------------------------
nltk.download("punkt", quiet=True)
nltk.download("punkt_tab", quiet=True)
nltk.download("stopwords", quiet=True)
nltk.download("wordnet", quiet=True)
nltk.download("omw-1.4", quiet=True)
nltk.download("averaged_perceptron_tagger", quiet=True)
nltk.download("averaged_perceptron_tagger_eng", quiet=True)
# ----------------------------
# Preprocessing setup
# ----------------------------
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()
def get_wordnet_pos(tag: str):
if tag.startswith("J"):
return ADJ
elif tag.startswith("V"):
return VERB
elif tag.startswith("N"):
return NOUN
elif tag.startswith("R"):
return ADV
else:
return NOUN
def preprocess(text: str):
tokens = word_tokenize(text.lower())
tokens = [t for t in tokens if t.isalnum()]
tokens = [t for t in tokens if t not in stop_words]
pos_tags = nltk.pos_tag(tokens)
return [lemmatizer.lemmatize(t, get_wordnet_pos(pos)) for t, pos in pos_tags]
# ----------------------------
# Example corpus
# ----------------------------
corpus = [
"Python is a popular programming language for data science and AI.",
"Machine learning and deep learning are subsets of artificial intelligence.",
"The fox is quick and brown, jumping over a lazy dog.",
"Developers use Python for natural language processing and search engines.",
"Dogs are loyal animals, often considered man's best friend."
]
tokenized_corpus = [preprocess(doc) for doc in corpus]
# ----------------------------
# Initialize BM25Plus with parameter tuning
# ----------------------------
k1 = 1.5 # term frequency saturation
b = 0.75 # length normalization
bm25 = BM25Plus(tokenized_corpus, k1=k1, b=b)
# ----------------------------
# Query
# ----------------------------
query = "python search ai"
tokenized_query = preprocess(query)
# ----------------------------
# Compute scores
# ----------------------------
scores = bm25.get_scores(tokenized_query)
# ----------------------------
# Rank documents
# ----------------------------
ranked = sorted(zip(scores, corpus), key=lambda x: x[0], reverse=True)
print(f"Query: {query}\n")
print("Ranked Results with k1 =", k1, "and b =", b)
for score, doc in ranked:
print(f"{score:.4f} -> {doc}")
Output:
(env) D:\projects\ranjankumar.in\posts\bm25>python bm25.py
Query: python search ai
Ranked Results with k1 = 1.5 and b = 0.75
7.6091 -> Python is a popular programming language for data science and AI.
7.4349 -> Developers use Python for natural language processing and search engines.
4.6821 -> Machine learning and deep learning are subsets of artificial intelligence.
4.6821 -> The fox is quick and brown, jumping over a lazy dog.
4.6821 -> Dogs are loyal animals, often considered man's best friend.
â What this script demonstrates:
- Preprocessing pipeline:
- Converts text to lowercase
- Removes punctuation
- Removes stopwords
- Lemmatizes using POS tags
- BM25Plus scoring:
- Assigns higher scores to documents that match query tokens
- Avoids negative scores (common in small corpora)
- Ranking documents:
- Displays the most relevant documents first
â Parameter Tuning
BM25 has two main tunable parameters:
k1
â controls term frequency saturation- Higher
k1
â repeated terms matter more - Typical range:
1.2 â 2.0
- Higher
b
â controls document length normalizationb=1
â full normalization (long docs penalized)b=0
â no normalization (like TF only)- Typical range:
0.5 â 0.8
We add these as arguments when initializing BM25Plus
and make it flexible to tune per corpus.
How to tune these parameters?
- Short documents (tweets, messages): lower
b
â reduces length normalization - Long documents (articles, reports): higher
b
â penalizes very long docs - k1: adjust depending on whether repeated terms should contribute more
Example experimentation:
k1 | b | Observations |
---|---|---|
1.2 | 0.5 | Short docs weighted less by term repetition |
1.5 | 0.75 | Default, works well for medium-length docs |
2.0 | 0.8 | Long documents get penalized less, repeated terms matter more |
4ď¸âŁ Integrating BM25 with Embeddings (Hybrid Search)
Problem:
BM25 is purely lexical â it cannot capture semantic similarity. Two documents with different words but same meaning (synonyms, paraphrases) are missed.
Solution:
Combine BM25 with dense vector embeddings (from BERT, SentenceTransformers, etc.):
- BM25 â captures exact matches
- Embeddings â captures semantic matches
- Final score â weighted combination of BM25 + embedding similarity
Benefits:
- Achieves high recall + semantic understanding
- Often called hybrid retrieval
- Works well for question-answering, document search, recommendation systems
Python Sketch:
bm25_scores = bm25.get_scores(tokenized_query)
embedding_scores = compute_embedding_scores(query, corpus) # cosine similarity
final_scores = 0.7 * bm25_scores + 0.3 * embedding_scores
Other Optimizations:
- Query Expansion:
- Expand queries with synonyms or related terms to increase recall.
- e.g.,
"AI"
â"artificial intelligence"
- Stopword & Lemmatization Optimization:
- Remove or retain stopwords depending on corpus.
- Lemmatization reduces word form mismatch.
- Weighted BM25:
- Assign weights to fields (title, body, tags) for more structured search.
- e.g.,
score = 2*title_score + body_score
5ď¸âŁ Applications of BM25
- Search engines: Elasticsearch, Solr, Google-like search
- QA systems: Ranking candidate documents for neural models
- Recommendation engines: Find relevant items from textual metadata
- Legal / academic search: Efficient retrieval from large corpora
Note: E-commerce search often combines BM25 + embeddings to improve ranking of product descriptions and reviews.
6ď¸ Advantages and Limitations
â Advantages
- Simple and interpretable
- Efficient on large corpora
- Strong baseline for retrieval tasks
- Tunable parameters allow corpus-specific optimization
â Limitations
- Only handles exact lexical matches
- Struggles with synonyms or paraphrased queries
- Modern neural retrieval can outperform in semantic tasks
- Hybrid BM25 + embeddings often needed for semantic search
đ Conclusion: Roadmap for Developers
- Start with BM25 for document ranking.
- Experiment with BM25+ or BM25L for large corpora.
- Combine with embeddings for hybrid search.
- Tune k1 and b based on your corpus.
- Explore neural ranking models (BERT-based, etc.) for semantic similarity.
BM25 remains the gold standard for lexical retrieval, balancing interpretability, efficiency, and performance.
đ References
- Robertson, S., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in IR.
- Manning, C., Raghavan, P., & SchĂźtze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
- Elasticsearch Documentation: https://www.elastic.co/guide/en/elasticsearch
- Rank-BM25 Python Library: https://github.com/dorianbrown/rank_bm25
1 Comment