Retrieval Quality · Sudeep Lalka

Retrieval substrate

Retrieval mode

BM25 stays the explainable baseline. Vector, hybrid, and governed re-rank build on the same retriever seam.

Term-overlap ranking. Fast and explainable, but misses semantically similar evidence when wording differs.

Lexical BM25 baseline — strong when queries share important terms with source text: fast, explainable, and a useful floor, but it misses semantically similar evidence when wording differs. Query: “How many days do I have to submit a reimbursement request?”

Lexical BM25

Ranked evidence

Top evidence for the selected retrieval mode.

#1Expense Policy v3.1

lex 1vec 1hybrid 1final 1

Reimbursement requests must be submitted within 30 days of the expense date.

#2Raw customer PII exportno citation meta

lex 0.6vec 0.57hybrid 0.59final 0.6

Personal credit-card statements attached to claims may contain sensitive cardholder data.

#3Approval Matrix v1.2

lex 0.42vec 0.43hybrid 0.42final 0.42

Manager approval is required for any expense over $500 before reimbursement.

#4Travel Policy v2.4

lex 0vec 0hybrid 0final 0

Employees may claim mileage at the standard rate for approved business travel.

Rank movement

How each mode re-orders the evidence

Follow a source across the four modes. Rising lines gain authority under governed re-rank; falling lines lose it; blocked sources drop to the exclusion gutter.

Click a line to isolate it. Green = gains rank under governed re-rank · amber = loses rank · rose = excluded by the Data handoff.

Side by side

Retrieval mode comparison

Mode	Top evidence	Strength	Risk	Latency	Cost
Lexical BM25	Expense Policy v3.1	Explainable exact matches	Misses semantic matches	120 ms	$0.009
Simulated vector retrieval	Expense Policy v3.1	Handles wording variation	May retrieve vague neighbors	180 ms	$0.011
Hybrid lexical + vector	Expense Policy v3.1	Balanced, strongest general option	Requires weight tuning	210 ms	$0.013
Hybrid + re-rank	Expense Policy v3.1	Governed top evidence (release candidate)	Higher latency	280 ms	$0.015

Trace comparison by retrieval mode

How mode changes the pipeline

Changing retrieval mode changes which evidence reaches the answer engine — and its citation quality, faithfulness, risk, latency, and cost.

Mode	Citations	Faithfulness	Hallucination	Quality	Latency	Cost
Lexical BM25	82%	84%	12%	78	120 ms	$0.009
Simulated vector retrieval	84%	85%	11%	82	180 ms	$0.011
Hybrid lexical + vector	88%	88%	9%	86	210 ms	$0.013
Hybrid + re-rank	93%	91%	6%	90	280 ms	$0.015

Production readiness

Vector index readiness

Missing

Embedding model selectedPartial

Vector store targetMissing

Similarity metric (cosine)Ready

Metadata filters availableMissing

Access-control filtersPartial

Stable chunk IDsPartial

Source versioningMissing

Re-indexing strategyMissing

Deletion / update strategyMissing

Source exclusion rulesReady

Hybrid search enabledReady

ANN index requiredNot required

Recommendation: Not ready for vector indexing — complete the Data handoff first

Retrieval simulation boundary

What’s real vs modeled here

This portfolio demo runs locally and needs no hosted vector database. BM25 is a real lexical baseline. Vector and hybrid retrieval use deterministic local representations to demonstrate ranking tradeoffs and lifecycle handoffs. In production, the same retriever seam could be backed by OpenAI, MiniLM, Voyage, or Cohere embeddings over Pinecone, Weaviate, pgvector, Milvus, or Elasticsearch.

What this retrieval layer demonstrates

Lexical baseline

BM25 as an explainable floor — and a clear view of where it fails.

Semantic retrieval

Local vector similarity handles wording variation the baseline misses.

Hybrid search

Lexical + vector fusion balances precision and recall.

Governance-aware re-rank

Authority, freshness, metadata, citations, and Data-handoff exclusions reorder evidence.

Traceable quality impact

Every mode shows its effect on citation quality, faithfulness, risk, latency, and cost.

Precision@5

0.83

target 0.80

Recall@5

0.86

target 0.85

MRR

0.82

target 0.78

NDCG

0.85

target 0.82

Top-K Success Rate

88%

target 90%

Empty Retrieval Rate

2.4%

target 3%

Context Utilization

71%

target 70%

Reranker Lift

7.4%

target 5%

Retrieval Strategy Comparison

Precision, recall, and ranking quality across six retrieval strategies.

Strategy Experiments

Full metric breakdown with latency and cost tradeoffs.

Strategy	P@5	R@5	MRR	NDCG	Retrieval	Faithfulness	Latency	Cost	Recommendation
Semantic search only	0.62	0.68	0.61	0.66	68	72	520ms	$0.021	Baseline. Misses keyword-heavy policy lookups.
Keyword search only	0.58	0.60	0.55	0.59	60	66	240ms	$0.012	Fast and cheap but weak on paraphrased questions.
Hybrid search	0.74	0.79	0.73	0.77	81	79	610ms	$0.029	Strong balance. Adopted as the retrieval baseline.
Hybrid + reranking	0.83	0.86	0.82	0.85	87	84	1320ms	$0.038	Best quality. Adds ~700ms; pushes P95 latency over SLA.
Query rewriting + hybrid	0.79	0.84	0.78	0.82	84	82	780ms	$0.031	Improves ambiguous and multi-hop recall at moderate cost.
Metadata-filtered retrieval	0.85	0.82	0.83	0.84	86	85	700ms	$0.030	Best for high-risk policy lookups; filters out stale versions.

Chunking Experiments

Chunk size and strategy sweep with hybrid + reranking held constant.

Chunking	Size	Overlap	P@5	R@5	NDCG	Retrieval	Latency	Cost	Recommendation
Fixed	300	50	0.86	0.78	0.83	84	1180ms	$0.034	High precision but misses full context on multi-part answers.
Fixed	500	100	0.83	0.86	0.85	87	1320ms	$0.038	Best overall balance. Current production setting.
Fixed	800	150	0.76	0.88	0.83	84	1480ms	$0.044	Better completeness but lower precision and higher cost.
Section-based	section	n/a	0.85	0.84	0.86	86	1260ms	$0.037	Strong for structured policy docs; preserves clause boundaries.
Semantic	variable	n/a	0.84	0.85	0.85	86	1390ms	$0.041	Comparable quality; higher indexing complexity.