Evaluation Runs
Select a run to inspect its scorecard and regression analysis.
| Run | Date | Retriever | Reranker | Prompt | Cases | Overall | Pass | Critical Fails | Regression | Release |
|---|---|---|---|---|---|---|---|---|---|---|
| baseline-vector-rag-v1 | 2026-01-12 | Semantic (dense) only | prompt-v1 | 38 | 64 | 66% | 5 | No Regression | Block | |
| query-rewrite-v2 | 2026-02-09 | Semantic + query rewriting | prompt-v2 | 44 | 69 | 70% | 4 | No Regression | Hold | |
| hybrid-search-v3 | 2026-03-08 | Hybrid (dense + BM25) | prompt-v2 | 48 | 73 | 75% | 3 | No Regression | Hold | |
| reranker-enabled-v4 | 2026-04-11 | Hybrid + cross-encoder reranker | prompt-v3 | 48 | 76 | 79% | 3 | Watch | Hold | |
| citation-validator-v5 | 2026-05-16 | Hybrid + reranker | prompt-v4 (citation-grounded) | 50 | 76 | 78% | 2 | Watch | Hold | |
| compliance-guardrail-v6 | 2026-06-15 | Hybrid + reranker + metadata filter | prompt-v5 (guardrail + escalation) | 50 | 78 | 82% | 1 | Watch | Hold |
Scorecard · compliance-guardrail-v6
Compliance guardrails cut critical failures to one and improved high-risk handling. Citation accuracy and P95 latency remain below target.
Overall Scorevs prev: +2
78%target 80%
Retrieval Qualityvs prev: 0
86%target 85%
Faithfulnessvs prev: +1
84%target 85%
Citation Accuracyvs prev: +4
82%target 85%
Pass Ratevs prev: +4
82%target 85%
High-Risk Pass Ratevs prev: +4
87%target 90%
Hallucination Risk
11%
Avg Latency
2.60s
P95 Latency
4.25s
Cost / Query
$0.042
Regression Analysis
vs citation-validator-v5
No Regression
- All tracked metrics are within regression tolerances.
Release Recommendation
Hold
Run Comparison
Overall score and citation accuracy across all runs. Selected run highlighted.