Building Hybrid Search That Actually Works: BM25 + Dense Retrieval + Cross-Encoders

Most hybrid search tutorials will tell you to "just combine BM25 and dense retrieval" and you'll magically get better results. They'll show you a slider for adjusting weights, some Docker commands, and call it a day.

Here's what they won't tell you: hybrid search is not a silver bullet. It's a set of engineering tradeoffs where every decision—fusion algorithm, weight balance, reranking strategy—depends entirely on your data and queries. Get it wrong and you'll make things worse than using BM25 alone.

I built a legal document search system to figure out what actually works. Not for benchmarks, not for a demo, but for a system where wrong answers have consequences. This article is about what I learned: when hybrid search helps, when it hurts, and how to actually implement it without the marketing fluff.

We'll build a complete system with BM25 (sparse retrieval), dense vector embeddings (bi-encoders), and cross-encoder reranking. But more importantly, we'll talk about why each piece exists, when you'd skip it, and how to debug when things break.

The Problem With Search (And Why Hybrid Matters)

Let me show you a real failure mode. Our legal corpus contains this document:

"Section 420 of the Indian Penal Code: Whoever cheats and thereby dishonestly induces any person to deliver any property..."

User query: "What's the law against cheating?"

BM25-only search:

Finds: Section 420 (exact match on "cheating")
Misses: Article 19 about deception (no word "cheating")
Misses: Fraud Prevention Act (says "fraudulent" not "cheating")

BM25 is precise but brittle. No shared keywords = no results.

Dense retrieval (bi-encoder embeddings):

Finds: Section 420, Article 19, Fraud Prevention Act (all semantically similar)
Also finds: Contract law about breach of trust (semantically similar but wrong context)
Ranks them unpredictably (similarity scores don't always reflect legal relevance)

Dense retrieval casts a wide net but lacks precision. Everything vaguely similar gets retrieved.

Why this matters in legal AI:

Legal practitioners use specific terminology ("Section 420 IPC") that must match exactly. But they also ask natural language questions ("what happens if someone lies to get money?") that require semantic understanding.

You need both. BM25 (sparse retrieval) ensures statutory references aren't dropped. Dense embeddings capture semantic meaning for natural language queries. Neither is sufficient alone.

That's the theory. The practice is messier.

What Hybrid Search Actually Does

Hybrid search runs BM25 (sparse retrieval) and dense retrieval (bi-encoder embeddings) in parallel, then combines their ranked lists. The challenge is that BM25 scores and embedding similarity scores live in completely different numerical spaces.

BM25 scores (sparse vectors):

Unbounded values that vary based on document length and term frequency
I've seen scores from 0.1 to 50+ in the same result set
Based on exact keyword overlap and statistical rarity (IDF)

Dense embedding similarity (cosine similarity):

Typically ranges from 0 to 1 (or -1 to 1 depending on normalization)
Most scores clustered between 0.6 and 0.9
Based on semantic meaning in high-dimensional space

You can't just add these numbers together. You need a fusion algorithm.

Reciprocal Rank Fusion (RRF)

RRF is the most common approach because it's score-agnostic—it only cares about ranking positions, not actual scores.

code

def rrf_score(doc_id, bm25_rank, dense_rank, k=60):    """    k=60 is the standard constant from research papers    Rank positions start at 1 (1st place = rank 1)    """    score = 0    if doc_id in bm25_results:        score += 1 / (k + bm25_rank)    if doc_id in dense_results:        score += 1 / (k + dense_rank)    return score

For a document that ranks 1st in BM25 and 3rd in dense retrieval:

BM25 contribution: 1/(60+1) = 0.0164
Dense retrieval contribution: 1/(60+3) = 0.0159
Total RRF score: 0.0323

Why k=60? Research shows it's a good default. Lower k (like k=10) amplifies top-ranked differences. Higher k (like k=100) flattens the curve. I've tested k values from 20 to 100 and 60 consistently works well across different datasets.

RRF advantages:

No score normalization needed
Robust to outliers (a BM25 score of 50 vs 5 doesn't matter, only rank matters)
Works out of the box without tuning

RRF disadvantages:

Loses information (two documents with BM25 scores of 10.0 and 9.9 are treated identically if both rank 1st)
Can't weight BM25 vs dense retrieval differently
Sometimes you DO want to care about score magnitudes

Weighted Score Fusion

The alternative is normalizing scores and using weighted combination:

code

def normalize_scores(scores):    """Min-max normalization to [0, 1]"""    if not scores:        return {}    min_score = min(scores.values())    max_score = max(scores.values())    if max_score == min_score:        return {k: 1.0 for k in scores}    return {        k: (v - min_score) / (max_score - min_score)         for k, v in scores.items()    }def weighted_fusion(bm25_scores, dense_scores, alpha=0.5):    """    alpha: weight for dense retrieval (0 to 1)    1-alpha: weight for BM25 (sparse retrieval)    alpha=0.5 means equal weighting    """    # Normalize both score sets to [0, 1]    norm_bm25 = normalize_scores(bm25_scores)    norm_dense = normalize_scores(dense_scores)        # Combine    all_docs = set(norm_bm25.keys()) | set(norm_dense.keys())    fused = {}    for doc_id in all_docs:        bm25_score = norm_bm25.get(doc_id, 0)        dense_score = norm_dense.get(doc_id, 0)        fused[doc_id] = alpha * dense_score + (1 - alpha) * bm25_score        return fused

This gives you control: alpha=0.7 emphasizes dense retrieval, alpha=0.3 emphasizes BM25.

When to use each:

Use RRF when:

Starting out (it just works)
Score distributions are unpredictable
You want something that works across different corpora without retuning

Use weighted fusion when:

You know one method should dominate (e.g., alpha=0.3 for BM25-heavy legal search)
Score magnitudes matter (e.g., a BM25 score of 40 really should outweigh a score of 2)
You're willing to tune weights for your specific data

In our legal search, I use weighted fusion with alpha=0.4 (60% BM25, 40% dense) because exact statutory references matter more than semantic similarity. Your domain might want alpha=0.6 or 0.7.

The Code: Hybrid Search Implementation

Here's the actual working system, not pseudocode:

code

from rank_bm25 import BM25Okapifrom sentence_transformers import SentenceTransformerimport numpy as npfrom typing import List, Dict, Tupleclass HybridSearch:    def __init__(self, alpha=0.5, fusion_method='rrf'):        """        alpha: weight for dense retrieval (0=BM25 only, 1=dense only)        fusion_method: 'rrf' or 'weighted'        """        self.alpha = alpha        self.fusion_method = fusion_method        # Bi-encoder for dense embeddings        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')        self.documents = []        self.bm25 = None        self.doc_embeddings = None            def index(self, documents: List[str]):        """Index documents for both sparse (BM25) and dense retrieval"""        self.documents = documents                # BM25 index (sparse vectors)        tokenized_docs = [doc.lower().split() for doc in documents]        self.bm25 = BM25Okapi(tokenized_docs)                # Dense vector index        print("Encoding documents for dense retrieval... (this takes a minute)")        self.doc_embeddings = self.encoder.encode(            documents,             show_progress_bar=True,            convert_to_numpy=True        )            def search(self, query: str, top_k=10) -> List[Tuple[int, float, str]]:        """        Returns: List of (doc_index, score, document_text)        """        # BM25 search (sparse retrieval)        bm25_scores = self.bm25.get_scores(query.lower().split())                # Dense retrieval (cosine similarity with bi-encoder embeddings)        query_embedding = self.encoder.encode(query, convert_to_numpy=True)        dense_scores = np.dot(self.doc_embeddings, query_embedding)                # Fusion        if self.fusion_method == 'rrf':            final_scores = self._rrf_fusion(bm25_scores, dense_scores)        else:            final_scores = self._weighted_fusion(bm25_scores, dense_scores)                # Sort and return top-k        ranked_docs = sorted(            enumerate(final_scores),            key=lambda x: x[1],            reverse=True        )[:top_k]                return [            (idx, score, self.documents[idx])             for idx, score in ranked_docs        ]        def _rrf_fusion(self, bm25_scores, dense_scores, k=60):        """Reciprocal Rank Fusion"""        # Get rankings (index in sorted order)        bm25_ranks = np.argsort(-bm25_scores)  # Descending        dense_ranks = np.argsort(-dense_scores)                n_docs = len(self.documents)        rrf_scores = np.zeros(n_docs)                for rank, doc_idx in enumerate(bm25_ranks, start=1):            rrf_scores[doc_idx] += 1 / (k + rank)                    for rank, doc_idx in enumerate(dense_ranks, start=1):            rrf_scores[doc_idx] += 1 / (k + rank)                    return rrf_scores        def _weighted_fusion(self, bm25_scores, dense_scores):        """Weighted score combination with normalization"""        # Min-max normalization        def normalize(scores):            min_s, max_s = scores.min(), scores.max()            if max_s == min_s:                return np.ones_like(scores)            return (scores - min_s) / (max_s - min_s)                norm_bm25 = normalize(bm25_scores)        norm_dense = normalize(dense_scores)                # alpha weight for dense, (1-alpha) for BM25        return self.alpha * norm_dense + (1 - self.alpha) * norm_bm25

Performance notes from testing:

For a corpus of 1,000 legal documents (~500 words each):

Indexing time: ~30 seconds (25s for dense embeddings, 5s for BM25)
Query latency:
- BM25 only: 15-20ms
- Dense only: 10-15ms (using numpy dot product, not FAISS)
- Hybrid (RRF): 30-40ms
- Hybrid (weighted): 35-45ms

The fusion overhead is minimal. Most latency comes from embedding the query (~8-12ms with the bi-encoder).

Cross-Encoder Reranking: When Hybrid Isn't Enough

Even with hybrid search, you'll get ranking mistakes. Here's a real example from our system:

Query: "Can a company fire someone for religious beliefs?"

Hybrid search top 5:

Article 15: Right against discrimination (score: 0.85)
Industrial Disputes Act: Unfair dismissal (score: 0.82)
Article 25: Right to freedom of religion (score: 0.79)
Labor Code Section 12: Employment termination (score: 0.76)
Contract Act: Breach of employment contract (score: 0.72)

The problem? Article 15 (discrimination) is semantically related but not the best answer. The Industrial Disputes Act is more directly relevant. But hybrid search scored Article 15 higher because it had better keyword overlap.

This is where cross-encoders help. Instead of encoding query and document separately, a cross-encoder processes them together:

code

from sentence_transformers import CrossEncoderclass HybridSearchWithReranking(HybridSearch):    def __init__(self, alpha=0.5, fusion_method='rrf', use_reranking=True):        super().__init__(alpha, fusion_method)        self.use_reranking = use_reranking        if use_reranking:            # 22.7M parameter model, ~90MB            self.reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')        def search(self, query: str, top_k=10, rerank_top_n=20):        """        rerank_top_n: Retrieve more candidates, then rerank down to top_k        """        # Step 1: Hybrid retrieval gets rerank_top_n candidates        candidates = super().search(query, top_k=rerank_top_n)                if not self.use_reranking:            return candidates[:top_k]                # Step 2: Cross-encoder reranking        doc_indices = [idx for idx, _, _ in candidates]        docs = [text for _, _, text in candidates]                # Create query-document pairs        pairs = [(query, doc) for doc in docs]                # Get reranker scores (this is the expensive part)        # On CPU: ~60-100ms for 20 pairs        # On GPU: ~50-80ms for 20 pairs        rerank_scores = self.reranker.predict(pairs)                # Sort by reranker score        reranked = sorted(            zip(doc_indices, rerank_scores, docs),            key=lambda x: x[1],            reverse=True        )[:top_k]                return reranked

After reranking:

Industrial Disputes Act: Unfair dismissal (rerank score: 9.2)
Article 15: Right against discrimination (rerank score: 8.7)
Labor Code Section 12: Employment termination (rerank score: 7.1)
Article 25: Freedom of religion (rerank score: 6.8)
Contract Act: Employment contracts (rerank score: 5.2)

Much better. The cross-encoder correctly identified that "unfair dismissal" is more directly relevant to the query about firing someone.

Cost/benefit of reranking:

Benefit: 10-25% improvement in top-3 precision (based on our manual evaluation) Cost: +60-100ms latency on CPU

Is it worth it? Depends on your use case:

Customer-facing legal search: Yes (accuracy matters more than 100ms)
Internal document exploration: Maybe (users might tolerate lower precision)
High-QPS API: Probably not (latency budget too tight)

Evaluation: Measuring What Actually Works

Here's where most tutorials hand-wave. They show Precision@5 numbers without explaining how they got them.

Evaluation requires ground truth: human-labeled relevant documents for each query. This is expensive and time-consuming. For our legal corpus, I manually labeled relevance for 50 queries × ~10 candidate documents each = 500 judgments. Took about 8 hours.

code

def evaluate(search_engine, test_queries, ground_truth, k=5):    """    test_queries: List of queries    ground_truth: Dict[query_id -> Set[relevant_doc_indices]]    """    precisions = []    recalls = []    ndcgs = []        for query_id, query in enumerate(test_queries):        results = search_engine.search(query, top_k=k)        retrieved_docs = {idx for idx, _, _ in results}        relevant_docs = ground_truth[query_id]                # Precision@k: How many retrieved docs are relevant?        true_positives = len(retrieved_docs & relevant_docs)        precision = true_positives / k if k > 0 else 0        precisions.append(precision)                # Recall@k: How many relevant docs did we retrieve?        recall = true_positives / len(relevant_docs) if relevant_docs else 0        recalls.append(recall)                # NDCG@k: Accounts for ranking quality        # Higher-ranked relevant docs contribute more        ndcg = compute_ndcg(results, relevant_docs, k)        ndcgs.append(ndcg)        return {        'precision@k': np.mean(precisions),        'recall@k': np.mean(recalls),        'ndcg@k': np.mean(ndcgs),        'num_queries': len(test_queries)    }def compute_ndcg(results, relevant_docs, k):    """Normalized Discounted Cumulative Gain"""    dcg = 0    for rank, (doc_idx, _, _) in enumerate(results[:k], start=1):        if doc_idx in relevant_docs:            # Relevance = 1 if relevant, 0 if not (binary relevance)            # More sophisticated: use graded relevance (0-3 scale)            dcg += 1 / np.log2(rank + 1)        # Ideal DCG: if all relevant docs were ranked first    ideal_dcg = sum(1 / np.log2(i + 2) for i in range(min(k, len(relevant_docs))))        return dcg / ideal_dcg if ideal_dcg > 0 else 0

Our results on 50 legal queries:

Method	Precision@5	Recall@5	NDCG@5	Avg Latency
BM25 only	0.62	0.71	0.68	18ms
Dense only	0.54	0.79	0.64	12ms
Hybrid (RRF)	0.71	0.76	0.75	35ms
Hybrid (α=0.4)	0.74	0.73	0.78	38ms
Hybrid + Rerank	0.82	0.72	0.84	115ms

What these numbers mean:

BM25 has high precision but misses paraphrases (lower recall)
Dense retrieval has high recall but retrieves irrelevant similar documents (lower precision)
RRF fusion improves balance without tuning
Weighted fusion (α=0.4) beats RRF because we tuned it for our domain
Reranking boosts precision and NDCG significantly, at cost of latency

Important caveats:

These are results on OUR dataset (1,000 Indian legal documents, 50 test queries)
Your domain will have different numbers
50 queries is not statistically significant for production (need 200+)
We used binary relevance (relevant/not relevant), not graded (somewhat/very/extremely relevant)

Don't trust anyone's benchmark numbers unless they explain their evaluation methodology.

What Actually Breaks in Production

Problem 1: Alpha weights need retuning when data changes

We started with alpha=0.5 (equal weighting). Precision was 0.68. After analyzing failures, we found BM25 was missing on too many paraphrase queries. Increased alpha to 0.6 (more dense weight). Precision dropped to 0.65—now we were getting too many vaguely related documents.

Settled on alpha=0.4 after testing 0.3, 0.4, 0.5, 0.6, 0.7 on our evaluation set.

Then the corpus changed. We added case law summaries that use different language than statutes. Alpha=0.4 stopped working. Had to retune to alpha=0.5.

Lesson: Optimal weights are corpus-specific and drift over time. Plan to reevaluate periodically.

Problem 2: Cross-encoder latency becomes a bottleneck

At 10 queries/second, reranking 20 candidates per query = 200 document pairs/second. On CPU, that's ~10 seconds of compute per second. You need parallel processing or GPU.

On a T4 GPU, we can batch process ~500 pairs/second. That handles 25 QPS comfortably. But T4 costs ₹34,558/month on AWS (on-demand) or ₹10,000-18,000/month (spot instances).

Alternative: Cache reranked results for common queries. 80% of our queries are repeats. Caching reduced GPU usage by 75%.

Problem 3: Short queries break dense retrieval

Query: "420" Dense retrieval returns: Documents about numbers, dates, addresses—anything with "420" in it.

BM25 correctly finds Section 420 IPC.

For queries shorter than 3 words, we increase BM25 weight (alpha=0.2 instead of 0.4). This is a hack but it works.

Problem 4: Evaluation is never done

You fix precision, recall drops. You fix recall, NDCG drops. You optimize NDCG, latency increases. There's no "done" state, only tradeoffs you can live with.

Our current setup (hybrid with reranking, alpha=0.4, cache enabled):

Precision@5: 0.82 (good enough)
Recall@5: 0.72 (acceptable)
NDCG@5: 0.84 (solid)
P95 latency: 140ms (livable)

We could boost precision to 0.87 with a larger reranker (bge-reranker-large), but latency would jump to 250ms. Not worth it for our use case.

Running the System (Docker Setup)

The code is on GitHub if you want to try it:

code

git clone https://github.com/ranjankumar-gh/hybrid-legal-search.gitcd hybrid-legal-searchdocker build -t hybrid-search .docker run -p 8000:8000 hybrid-search

Open http://localhost:8000 for the UI.

What you get:

Search interface with alpha weight slider (0 to 1)
Results showing BM25 score, dense score, and final fused score
Query term highlighting
Option to enable/disable reranking

What's missing (because this is a demo, not production):

No persistent storage (everything in memory)
No query logging or analytics
No A/B testing framework
No authentication or rate limiting
Synthetic dataset (use with caution for real legal research)

To use this for real:

Replace the toy dataset with your actual corpus
Swap in-memory storage with Qdrant/Weaviate/Elasticsearch
Add query logging and evaluation pipeline
Tune alpha weights on your evaluation set
Decide if reranking is worth the latency cost
Set up monitoring (latency, precision, user feedback)

When NOT to Use Hybrid Search

Hybrid search is not always better. Here's when to stick with simpler approaches:

Use BM25 only when:

Your queries are keyword-based (product SKUs, error codes, IDs)
Exact matches matter more than semantic similarity
You need <20ms latency (dense encoding adds 10-15ms)
Your corpus is small (<10k documents) and well-indexed

Use dense retrieval only when:

Queries are natural language questions
You want semantic similarity over keyword precision
Your domain has high synonym/paraphrase variation
You can tolerate some false positives

Use hybrid when:

You have BOTH keyword and natural language queries
Precision and recall both matter
You're willing to spend time tuning fusion weights
Latency budget is >50ms

Skip reranking when:

Hybrid search precision is already >80%
Latency budget is <100ms
Query volume is high (>100 QPS) and you can't cache effectively
You don't have GPU and can't tolerate CPU latency

The Honest Conclusion

Hybrid search isn't magic. It's two search systems glued together with fusion logic that needs tuning. Get the weights wrong and it's worse than using one method alone.

But when done right—correct fusion method, properly tuned weights, evaluation-driven optimization—it genuinely improves search quality. Not 10x better. Maybe 15-30% better on precision and recall. Which, if your use case depends on search quality, is huge.

The hard parts aren't the code (that's straightforward). The hard parts are:

Getting ground truth labels for evaluation
Tuning fusion weights for your specific data
Deciding if reranking latency is worth the precision gain
Monitoring and retuning as your corpus evolves

If you're building search for AI systems (RAG, legal research, document QA), hybrid search is worth implementing. Just don't expect it to work out of the box with default settings. Budget time for evaluation and tuning.

And remember: the best search system is the one that works for your users on your data, not the one with the highest benchmark numbers on someone else's dataset.

Code Repository: https://github.com/ranjankumar-gh/hybrid-legal-search

Information Retrieval