Most RAG systems ship with a straightforward architecture: embed the query, find the top-k nearest chunks in vector space, feed them to the LLM. It's a reasonable starting point - but in production, over domain-specific corpora, it plateaus fast.
This post covers the two upgrades that gave us a 25% lift in retrieval accuracy on a financial document corpus: hybrid retrieval and cross-encoder re-ranking.
Why Dense Retrieval Alone Falls Short
Dense retrieval (embedding-based similarity search) is excellent at capturing semantic meaning. Ask "what is the repayment schedule?" and it finds chunks about loan terms even if those exact words don't appear.
But it has two weaknesses:
- Lexical misses - domain-specific identifiers, product codes, or abbreviations that the embedding model hasn't seen well in pretraining don't cluster correctly in embedding space.
- Top-k is a blunt instrument - returning the top 10 chunks based on cosine similarity gives you no signal about how much better chunk #1 is than chunk #10. Everything in the window gets equal weight when passed to the LLM.
For a general-purpose chatbot, this is fine. For document-heavy domains - financial regulations, insurance policies, medical records - it isn't.
Hybrid Retrieval: Combining BM25 and Dense Search
The fix for the first problem is hybrid retrieval: run BM25 (lexical) and dense search in parallel, then merge results.
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain_community.vectorstores import FAISS
# Dense retriever
dense_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": 20})
# Sparse retriever
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 20
# Hybrid: 60% dense, 40% sparse
ensemble = EnsembleRetriever(
retrievers=[dense_retriever, bm25_retriever],
weights=[0.6, 0.4]
)
The key insight: BM25 is unbeatable for exact-match recall. Dense retrieval handles synonyms and paraphrases. Combined, you get more of the relevant chunks into the candidate pool before the next stage.
The weights (0.6 / 0.4) aren't universal - tune them on a labeled eval set for your domain.
Cross-Encoder Re-ranking
Hybrid retrieval solves the recall problem. Re-ranking solves the precision problem.
After hybrid retrieval, you have a pool of ~20–40 candidate chunks. A cross-encoder - a small BERT-style model fine-tuned for passage relevance - scores each (query, chunk) pair directly. Unlike embedding similarity, the cross-encoder sees both the query and the passage together, giving it much richer signal.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, candidates: list[str], top_n: int = 5) -> list[str]:
pairs = [(query, chunk) for chunk in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(scores, candidates), reverse=True)
return [chunk for _, chunk in ranked[:top_n]]
This is the part that most teams skip because it adds ~200ms of latency. In our case, the accuracy gain was worth it. After re-ranking, we feed only the top 5 chunks to the LLM instead of 20 - shorter context, more precise answer.
Metadata Filtering
One more lever: don't let retrieval degrade by searching the entire corpus when you already know the scope.
If a user asks about "Q3 2023 loan default rates," you can filter by document type and date range before search - eliminating irrelevant chunks before the embedding model even runs.
retriever = vectorstore.as_retriever(
search_kwargs={
"k": 20,
"filter": {"doc_type": "financial_report", "year": 2023}
}
)
Metadata schema design matters more than most engineers expect. If you're building a new RAG system, invest time upfront tagging your chunks - it pays off at query time.
What This Looks Like End-to-End
Results
On our financial document corpus (10k+ pages of loan agreements, annual reports, and regulatory filings):
| Configuration | Retrieval Accuracy | |---|---| | Dense-only, top-10 | Baseline | | Hybrid retrieval, top-10 | +14% | | Hybrid + re-ranking, top-5 | +25% |
The latency cost of re-ranking was ~180ms p95. For our use case - an internal analyst tool - that was acceptable. If you're building a real-time chat API, you might want to use a lighter cross-encoder or cache re-ranked results for repeated queries.
Key Takeaways
- Dense retrieval alone is a good baseline but plateaus on domain-specific corpora
- Hybrid retrieval (BM25 + dense) measurably improves recall for lexically precise queries
- Cross-encoder re-ranking improves precision by scoring (query, passage) pairs holistically
- Metadata filtering is underrated - design your chunk metadata schema early
- Tune retrieval weights on a domain-specific labeled eval set - don't guess