Retrieval-Augmented Generation (RAG) is one of the most effective techniques for making large language models (LLMs) answer accurately using external knowledge.
The idea is straightforward:
- Retrieve relevant documents from your knowledge base.
- Augment your LLM prompt with those documents.
- Generate an answer using the LLM.
Sounds simple, right? The problem is:
Even the best vector search algorithms sometimes return documents that are only loosely related to the query — or miss subtle but highly relevant matches.
This is where Reranking enters the scene — the “quality filter” for your retrieved documents.
What is Reranking in RAG?
Reranking is a second-stage filtering process that reorders retrieved documents by actual relevance to the user query, often using a more sophisticated model than the one used for the initial retrieval.
Think of it as precision tuning:
- Stage 1 (vector retrieval) → Fast and broad: retrieve 30–100 potentially relevant docs.
- Stage 2 (reranking) → Slow but sharp: deeply score these docs for true relevance.
This two-stage approach mirrors real-world search engines like Google, which first retrieve a broad set of results (recall-focused) and then apply a more precise ranking model (precision-focused).
This is especially important because standard retrieval models (like BM25, dense embeddings) often prioritize speed over deep contextual matching. Reranking uses more advanced models (like cross-encoders) that compare the query and each document together for higher precision.
Why Reranking Matters in RAG
Without reranking, your RAG model might answer from a less relevant document simply because it was retrieved higher by the retriever’s default scoring.
Example:
Imagine a customer of the State Bank of India (SBI) asks:
“What is the minimum balance required for an SBI savings account in a metro city?”
Without Reranking:
- Retriever might pull in documents about fixed deposit interest rates, ATM withdrawal limits, and minimum balance rules for rural branches.
- The first retrieved document might mention “minimum balance” but for rural accounts, not metro city accounts.
With Reranking:
- The reranker analyzes the exact query and re-scores documents so that the top-ranked one specifically contains:
- Metro city rules
- SBI’s updated minimum balance criteria
- Correct fee details if balance is below the limit
This ensures the generator receives the right context and produces a correct answer.
Common Reranking Techniques
Here are the most common approaches used in production RAG systems:
1. Cross-Encoder Models
- Takes the query and document together as input.
- Outputs a single relevance score.
- Pros: Very accurate.
- Cons: Slower, since each document is scored independently.
Python Example
from sentence_transformers import CrossEncoder
# Load a cross-encoder model
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# Example query
query = "What is the minimum balance required for an SBI savings account in a metro city?"
# Retrieved documents
documents = [
"SBI savings account in metro cities requires a minimum balance of Rs. 3,000 to avoid penalties.",
"SBI fixed deposit interest rates vary between 3% and 6% depending on tenure.",
"In rural areas, SBI savings accounts require a minimum balance of Rs. 1,000."
]
# Prepare pairs for scoring
pairs = [(query, doc) for doc in documents]
# Score each document for relevance
scores = model.predict(pairs)
# Sort by score (descending)
reranked_docs = [doc for _, doc in sorted(zip(scores, documents), reverse=True)]
print("Reranked Documents:")
for doc in reranked_docs:
print(doc)
Sample Output:
Reranked Documents:
SBI savings account in metro cities requires a minimum balance of Rs. 3,000 to avoid penalties.
In rural areas, SBI savings accounts require a minimum balance of Rs. 1,000.
SBI fixed deposit interest rates vary between 3% and 6% depending on tenure.
2. Bi-Encoder + Cross-Encoder Hybrid
- First, a fast bi-encoder retrieves candidates.
- Then, a cross-encoder reranks the top results.
- Best of both worlds — speed and accuracy.
Python Example
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import torch
# Step 1: Create SBI corpus
corpus = [
"The minimum balance required for SBI savings account is ₹1000 in metro cities.",
"SBI provides 7.5% interest rate for senior citizen fixed deposits.",
"You can link your Aadhaar to your SBI account through the YONO app.",
"SBI charges ₹20 per transaction for ATM withdrawals beyond the free limit.",
"The SBI home loan interest rate starts from 8.5% per annum.",
"SBI credit cards offer reward points on every transaction."
]
# Step 2: Load Bi-Encoder and Cross-Encoder
bi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1') # For retrieval
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2') # For reranking
# Step 3: Encode corpus for Bi-Encoder retrieval
corpus_embeddings = bi_encoder.encode(corpus, convert_to_tensor=True)
# Step 4: User query
query = "What is the interest rate for senior citizen FD in SBI?"
query_embedding = bi_encoder.encode(query, convert_to_tensor=True)
# Step 5: Retrieve top N candidates using Bi-Encoder
top_k = 3
bi_encoder_hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=top_k)[0]
# Step 6: Prepare for Cross-Encoder reranking
cross_inp = [(query, corpus[hit['corpus_id']]) for hit in bi_encoder_hits]
cross_scores = cross_encoder.predict(cross_inp)
# Step 7: Combine results and sort by Cross-Encoder score
reranked_results = sorted(
zip(cross_inp, cross_scores),
key=lambda x: x[1],
reverse=True
)
# Step 8: Print results
print(f"Query: {query}\n")
print("Top Results after Reranking:")
for (q, passage), score in reranked_results:
print(f"Score: {score:.4f} | {passage}")
Sample Output:
Query: What is the interest rate for senior citizen FD in SBI?
Top Results after Reranking:
Score: 8.5123 | SBI provides 7.5% interest rate for senior citizen fixed deposits.
Score: 5.9012 | The SBI home loan interest rate starts from 8.5% per annum.
Score: 3.2710 | SBI credit cards offer reward points on every transaction.
3. LLM-based Reranking
- Uses large language models (e.g., GPT, LLaMA) to rate document relevance.
- Can understand nuanced and multi-step queries.
- Higher cost, but sometimes worth it for complex domains.
Python Example
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# 1. SBI Corpus
corpus = [
"The minimum balance required for SBI savings account in metro cities is ₹3000.",
"SBI offers a 3.5% interest rate for savings accounts up to ₹1 lakh.",
"SBI home loan interest rate starts from 8.5% per annum.",
"SBI fixed deposit for senior citizens offers 7.5% per annum interest."
]
# 2. Simulated Retrieval Output
retrieved_docs = [
corpus[1], # savings account interest
corpus[3], # senior citizen FD
corpus[0] # minimum balance
]
query = "What interest rate does SBI offer for fixed deposits for senior citizens?"
# 3. Load Phi-3-Mini-Instruct Model from Hugging Face
# Supports chat-style prompts with system, user, and assistant roles
model_name = "microsoft/phi-3-mini-128k-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype="auto",
trust_remote_code=True
)
# 4. Build prompt for reranking
prompt_prefix = "<|system|>You are an assistant that ranks documents by relevance.<|end|>\n"
prompt_prefix += f"<|user|>Query: {query}\nDocuments:\n"
for idx, doc in enumerate(retrieved_docs):
prompt_prefix += f"{idx}: {doc}\n"
prompt_prefix += "<|assistant|>Provide ranking as list of indexes [most relevant first], plus brief explanation."
# 5. Tokenize and generate
inputs = tokenizer(prompt_prefix, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=100,
temperature=0.0
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("=== Reranking Response ===")
print(response)
Sample Output:
=== Reranking Response ===
[1, 2, 0]
The most relevant document is index 1: "SBI fixed deposit for senior citizens offers 7.5% per annum interest."
It directly answers the query about FD interest for senior citizens.
Next is index 2: "The minimum balance required for SBI savings account in metro cities is ₹3000."
While not about fixed deposits, it mentions account-related terms.
Index 0: "SBI offers a 3.5% interest rate for savings accounts up to ₹1 lakh."
This is least relevant because it talks about savings account rates, not fixed deposit rates.
Best Practices for Reranking in RAG
- Limit the candidate pool — Avoid reranking all retrieved results; rerank only the top N (e.g., 50).
- Use domain-specific fine-tuning — Fine-tune reranker models on your domain data for better accuracy.
- Cache results — For frequent queries, store reranked results to save computation.
- Balance speed vs accuracy — In real-time applications, choose models that meet your latency requirements.
- Continuously evaluate — Track metrics like MRR (Mean Reciprocal Rank) and nDCG to measure impact.
Conclusion
Reranking acts as a precision filter for RAG pipelines. By ensuring that the right documents make it to the generation stage, you can drastically reduce irrelevant or partially correct answers.
For any production-grade RAG system — whether it’s for banking FAQs, legal document search, or technical support — reranking can be the key differentiator in delivering high-quality, trustworthy AI answers.