← Back to projects
LLM System

RAG-Based Document Intelligence

LangChainLlamaIndexFAISSAzure Doc Intel

Standard RAG systems often fail on complex, domain-specific documents like insurance policies or financial regulations. This case study details the "production-level" upgrades that moved our system from prototype to 90%+ accuracy.

Problem: The Semantic Noise

Initial testing with naive RAG (simple embedding search) plateaued at 65% accuracy. The models struggled with:

  1. Lexical exact matches: missing specific regulatory codes.
  2. Context Fragmentation: Poor chunking strategies breaking semantic units.
  3. Large Corpora: Similar chunks competing for top-k slots, pushing the "correct" answer out of context.

Solution: The Advanced Retrieval Loop

I implemented a multi-stage retrieval architecture to maximize both recall and precision.

Engineering Insight: Why Hybrid?

[!NOTE] Embedding-only retrieval often misses "cold" keywords (specific ID numbers or rare legal terms) that haven't appeared frequently in the training corpus. Adding BM25 ensures high-recall for these exact identifiers.

The "Production" Layer

  • Semantic Chunking: Used LlamaIndex's semantic splitter to ensure chunks are broken based on actual meaning change, not just token counts.
  • Cross-Encoder Re-ranking: Employed BGE-Reranker-v2 to process the top 40 candidates. This adds latency (~150ms) but drastically reduces hallucination by ensuring context purity.
  • Evaluation Framework: Implemented RAGAS (Retrieval-Augmented Generation Evaluation) to track Faithfulness, Answer Relevance, and Context Precision across every iteration.

Architectural Trade-offs

  1. FAISS vs. Managed Vector DB: I chose local FAISS for the prototype to ensure strict data data privacy for sensitive financial docs, later migrating to a managed instance only after establishing an encryption-at-rest protocol.
  2. Chunk Size: We found 512 tokens to be the sweet spot-large enough for context, small enough to fit 5 candidate chunks into a 4k context window with room for a detailed prompt.

Impact: Moving the Needle

The results on our internal evaluation set (1,000+ labeled Q&A pairs) were significant:

  • +25% increase in overall retrieval accuracy.
  • -60% reduction in manual document processing time for the analyst team.
  • Hallucination suppression: By improving chunk relevance, the LLM's "I don't know" rate increased for out-of-document questions, preventing costly false answers.