RAG-Based Document Intelligence

Context

This system was built for domain-specific document intelligence over policy, insurance, and financial documents. The source data included OCR output, long-form PDFs, tables, and metadata that needed to be searchable and explainable.

The goal was to move beyond a prototype embedding search and build a retrieval pipeline that could support production question answering with fewer hallucinations and better traceability.

Problem

The first version used simple embedding search and plateaued around 65% retrieval accuracy. The main issues were:

Exact-match failures: embedding retrieval missed regulatory codes, IDs, clause names, and rare domain terms.
Chunk fragmentation: important context was split across chunks, especially in policy and table-heavy documents.
Noisy top-k results: similar chunks competed for limited context-window space.
Weak evaluation loop: prompt changes were difficult to compare without retrieval-level metrics.

Architecture

I implemented a multi-stage retrieval pipeline that separates recall from precision. The system retrieves broadly first, then narrows context before generation.

Components

OCR ingestion: converts PDFs and scanned documents into structured text and layout-aware metadata.
Chunking layer: creates retrieval units while preserving semantic boundaries and document metadata.
Hybrid retriever: combines lexical search for exact terms with vector search for semantic similarity.
Re-ranker: scores candidate chunks with a cross-encoder before context is passed to the LLM.
Evaluation loop: tracks retrieval quality, answer relevance, and faithfulness across test sets.

Key Decisions

Hybrid retrieval over vector-only search

Vector search works well for semantic similarity, but it often misses cold keywords such as policy IDs, regulation numbers, or exact clause names. BM25 improved recall for those cases.

Re-ranking before generation

Re-ranking adds latency, but it improves context quality before the LLM sees the prompt. This was a better trade-off than letting the generator reason over noisy top-k chunks.

Metadata as a first-class retrieval signal

Document type, page number, section, effective date, and source system metadata were kept with each chunk. This made filtering and answer traceability much stronger.

Reliability & Evaluation

Used retrieval-level evaluation instead of judging only final generated answers.
Compared chunking strategies using labeled Q&A pairs and context precision.
Tracked faithfulness and answer relevance to catch cases where the LLM answered beyond retrieved evidence.
Improved refusal behavior for out-of-document questions by increasing retrieved-context quality.

What I Owned

Built the OCR ingestion and document parsing flow for unstructured financial and policy documents.
Implemented hybrid retrieval, metadata-aware filtering, de-duplication, and cross-encoder re-ranking.
Created the evaluation workflow used to compare retrieval quality across chunking and ranking strategies.

Impact

Increased retrieval accuracy by about 25%.
Reduced manual document processing time by about 60% for analyst workflows.
Improved hallucination control by making missing evidence more visible to the answer-generation step.
Created a retrieval evaluation loop that made future prompt and ranking changes measurable.