Build · RAG

New here? How this lab works

Retrieval substrate

Retrieval mode

BM25 stays the explainable baseline. Vector, hybrid, and governed re-rank build on the same retriever seam.

Term-overlap ranking. Fast and explainable, but misses semantically similar evidence when wording differs.

Lexical BM25 baseline — strong when queries share important terms with source text: fast, explainable, and a useful floor, but it misses semantically similar evidence when wording differs. Query: How many days do I have to submit a reimbursement request?

Lexical BM25

Ranked evidence

Top evidence for the selected retrieval mode.

#1Expense Policy v3.1
lex 1vec 1hybrid 1final 1

Reimbursement requests must be submitted within 30 days of the expense date.

#2Raw customer PII exportno citation meta
lex 0.6vec 0.57hybrid 0.59final 0.6

Personal credit-card statements attached to claims may contain sensitive cardholder data.

#3Approval Matrix v1.2
lex 0.42vec 0.43hybrid 0.42final 0.42

Manager approval is required for any expense over $500 before reimbursement.

#4Travel Policy v2.4
lex 0vec 0hybrid 0final 0

Employees may claim mileage at the standard rate for approved business travel.

Rank movement

How each mode re-orders the evidence

Follow a source across the four modes. Rising lines gain authority under governed re-rank; falling lines lose it; blocked sources drop to the exclusion gutter.

#1#2#3#4cutexcl.Lexical BM25Simulated vector retrievalHybrid lexical + vectorHybrid + re-rankExpense Policy v3.1Raw customer PII exportApproval Matrix v1.2Travel Policy v2.4Expense Policy v1.0 (archived)

Click a line to isolate it. Green = gains rank under governed re-rank · amber = loses rank · rose = excluded by the Data handoff.

Side by side

Retrieval mode comparison

ModeTop evidenceStrengthRiskLatencyCost
Lexical BM25Expense Policy v3.1Explainable exact matchesMisses semantic matches120 ms$0.009
Simulated vector retrievalExpense Policy v3.1Handles wording variationMay retrieve vague neighbors180 ms$0.011
Hybrid lexical + vectorExpense Policy v3.1Balanced, strongest general optionRequires weight tuning210 ms$0.013
Hybrid + re-rankExpense Policy v3.1Governed top evidence (release candidate)Higher latency280 ms$0.015

Trace comparison by retrieval mode

How mode changes the pipeline

Changing retrieval mode changes which evidence reaches the answer engine — and its citation quality, faithfulness, risk, latency, and cost.

ModeCitationsFaithfulnessHallucinationQualityLatencyCost
Lexical BM2582%84%12%78120 ms$0.009
Simulated vector retrieval84%85%11%82180 ms$0.011
Hybrid lexical + vector88%88%9%86210 ms$0.013
Hybrid + re-rank93%91%6%90280 ms$0.015

Production readiness

Vector index readiness

Missing
Embedding model selectedPartial
Vector store targetMissing
Similarity metric (cosine)Ready
Metadata filters availableMissing
Access-control filtersPartial
Stable chunk IDsPartial
Source versioningMissing
Re-indexing strategyMissing
Deletion / update strategyMissing
Source exclusion rulesReady
Hybrid search enabledReady
ANN index requiredNot required

Recommendation: Not ready for vector indexing — complete the Data handoff first

Retrieval simulation boundary

What’s real vs modeled here

This portfolio demo runs locally and needs no hosted vector database. BM25 is a real lexical baseline. Vector and hybrid retrieval use deterministic local representations to demonstrate ranking tradeoffs and lifecycle handoffs. In production, the same retriever seam could be backed by OpenAI, MiniLM, Voyage, or Cohere embeddings over Pinecone, Weaviate, pgvector, Milvus, or Elasticsearch.

What this retrieval layer demonstrates

Lexical baseline

BM25 as an explainable floor — and a clear view of where it fails.

Semantic retrieval

Local vector similarity handles wording variation the baseline misses.

Hybrid search

Lexical + vector fusion balances precision and recall.

Governance-aware re-rank

Authority, freshness, metadata, citations, and Data-handoff exclusions reorder evidence.

Traceable quality impact

Every mode shows its effect on citation quality, faithfulness, risk, latency, and cost.
Precision@5

0.83

target 0.80

Recall@5

0.86

target 0.85

MRR

0.82

target 0.78

NDCG

0.85

target 0.82

Top-K Success Rate

88%

target 90%

Empty Retrieval Rate

2.4%

target 3%

Context Utilization

71%

target 70%

Reranker Lift

7.4%

target 5%

Retrieval Strategy Comparison

Precision, recall, and ranking quality across six retrieval strategies.

Strategy Experiments

Full metric breakdown with latency and cost tradeoffs.

StrategyP@5R@5MRRNDCGRetrievalFaithfulnessLatencyCostRecommendation
Semantic search only0.620.680.610.666872520ms$0.021Baseline. Misses keyword-heavy policy lookups.
Keyword search only0.580.600.550.596066240ms$0.012Fast and cheap but weak on paraphrased questions.
Hybrid search0.740.790.730.778179610ms$0.029Strong balance. Adopted as the retrieval baseline.
Hybrid + reranking0.830.860.820.8587841320ms$0.038Best quality. Adds ~700ms; pushes P95 latency over SLA.
Query rewriting + hybrid0.790.840.780.828482780ms$0.031Improves ambiguous and multi-hop recall at moderate cost.
Metadata-filtered retrieval0.850.820.830.848685700ms$0.030Best for high-risk policy lookups; filters out stale versions.

Chunking Experiments

Chunk size and strategy sweep with hybrid + reranking held constant.

ChunkingSizeOverlapP@5R@5NDCGRetrievalLatencyCostRecommendation
Fixed300500.860.780.83841180ms$0.034High precision but misses full context on multi-part answers.
Fixed5001000.830.860.85871320ms$0.038Best overall balance. Current production setting.
Fixed8001500.760.880.83841480ms$0.044Better completeness but lower precision and higher cost.
Section-basedsectionn/a0.850.840.86861260ms$0.037Strong for structured policy docs; preserves clause boundaries.
Semanticvariablen/a0.840.850.85861390ms$0.041Comparable quality; higher indexing complexity.