Retrieval substrate
Retrieval mode
BM25 stays the explainable baseline. Vector, hybrid, and governed re-rank build on the same retriever seam.
Term-overlap ranking. Fast and explainable, but misses semantically similar evidence when wording differs.
Lexical BM25
Ranked evidence
Top evidence for the selected retrieval mode.
Reimbursement requests must be submitted within 30 days of the expense date.
Personal credit-card statements attached to claims may contain sensitive cardholder data.
Manager approval is required for any expense over $500 before reimbursement.
Employees may claim mileage at the standard rate for approved business travel.
Rank movement
How each mode re-orders the evidence
Follow a source across the four modes. Rising lines gain authority under governed re-rank; falling lines lose it; blocked sources drop to the exclusion gutter.
Click a line to isolate it. Green = gains rank under governed re-rank · amber = loses rank · rose = excluded by the Data handoff.
Side by side
Retrieval mode comparison
| Mode | Top evidence | Strength | Risk | Latency | Cost |
|---|---|---|---|---|---|
| Lexical BM25 | Expense Policy v3.1 | Explainable exact matches | Misses semantic matches | 120 ms | $0.009 |
| Simulated vector retrieval | Expense Policy v3.1 | Handles wording variation | May retrieve vague neighbors | 180 ms | $0.011 |
| Hybrid lexical + vector | Expense Policy v3.1 | Balanced, strongest general option | Requires weight tuning | 210 ms | $0.013 |
| Hybrid + re-rank | Expense Policy v3.1 | Governed top evidence (release candidate) | Higher latency | 280 ms | $0.015 |
Trace comparison by retrieval mode
How mode changes the pipeline
Changing retrieval mode changes which evidence reaches the answer engine — and its citation quality, faithfulness, risk, latency, and cost.
| Mode | Citations | Faithfulness | Hallucination | Quality | Latency | Cost |
|---|---|---|---|---|---|---|
| Lexical BM25 | 82% | 84% | 12% | 78 | 120 ms | $0.009 |
| Simulated vector retrieval | 84% | 85% | 11% | 82 | 180 ms | $0.011 |
| Hybrid lexical + vector | 88% | 88% | 9% | 86 | 210 ms | $0.013 |
| Hybrid + re-rank | 93% | 91% | 6% | 90 | 280 ms | $0.015 |
Production readiness
Vector index readiness
Recommendation: Not ready for vector indexing — complete the Data handoff first
Retrieval simulation boundary
What’s real vs modeled here
This portfolio demo runs locally and needs no hosted vector database. BM25 is a real lexical baseline. Vector and hybrid retrieval use deterministic local representations to demonstrate ranking tradeoffs and lifecycle handoffs. In production, the same retriever seam could be backed by OpenAI, MiniLM, Voyage, or Cohere embeddings over Pinecone, Weaviate, pgvector, Milvus, or Elasticsearch.
What this retrieval layer demonstrates
Lexical baseline
Semantic retrieval
Hybrid search
Governance-aware re-rank
Traceable quality impact
0.83
target 0.80
0.86
target 0.85
0.82
target 0.78
0.85
target 0.82
88%
target 90%
2.4%
target 3%
71%
target 70%
7.4%
target 5%
Retrieval Strategy Comparison
Precision, recall, and ranking quality across six retrieval strategies.
Strategy Experiments
Full metric breakdown with latency and cost tradeoffs.
| Strategy | P@5 | R@5 | MRR | NDCG | Retrieval | Faithfulness | Latency | Cost | Recommendation |
|---|---|---|---|---|---|---|---|---|---|
| Semantic search only | 0.62 | 0.68 | 0.61 | 0.66 | 68 | 72 | 520ms | $0.021 | Baseline. Misses keyword-heavy policy lookups. |
| Keyword search only | 0.58 | 0.60 | 0.55 | 0.59 | 60 | 66 | 240ms | $0.012 | Fast and cheap but weak on paraphrased questions. |
| Hybrid search | 0.74 | 0.79 | 0.73 | 0.77 | 81 | 79 | 610ms | $0.029 | Strong balance. Adopted as the retrieval baseline. |
| Hybrid + reranking | 0.83 | 0.86 | 0.82 | 0.85 | 87 | 84 | 1320ms | $0.038 | Best quality. Adds ~700ms; pushes P95 latency over SLA. |
| Query rewriting + hybrid | 0.79 | 0.84 | 0.78 | 0.82 | 84 | 82 | 780ms | $0.031 | Improves ambiguous and multi-hop recall at moderate cost. |
| Metadata-filtered retrieval | 0.85 | 0.82 | 0.83 | 0.84 | 86 | 85 | 700ms | $0.030 | Best for high-risk policy lookups; filters out stale versions. |
Chunking Experiments
Chunk size and strategy sweep with hybrid + reranking held constant.
| Chunking | Size | Overlap | P@5 | R@5 | NDCG | Retrieval | Latency | Cost | Recommendation |
|---|---|---|---|---|---|---|---|---|---|
| Fixed | 300 | 50 | 0.86 | 0.78 | 0.83 | 84 | 1180ms | $0.034 | High precision but misses full context on multi-part answers. |
| Fixed | 500 | 100 | 0.83 | 0.86 | 0.85 | 87 | 1320ms | $0.038 | Best overall balance. Current production setting. |
| Fixed | 800 | 150 | 0.76 | 0.88 | 0.83 | 84 | 1480ms | $0.044 | Better completeness but lower precision and higher cost. |
| Section-based | section | n/a | 0.85 | 0.84 | 0.86 | 86 | 1260ms | $0.037 | Strong for structured policy docs; preserves clause boundaries. |
| Semantic | variable | n/a | 0.84 | 0.85 | 0.85 | 86 | 1390ms | $0.041 | Comparable quality; higher indexing complexity. |