Run the Live RAG Evaluator Lab
Upload a document, ask a question, and watch the evaluator score retrieval, citations, grounding, and hallucination risk in real time. No setup, no API keys — it runs right here.
How to Read This Dashboard
Basic RAG proves that answers can be generated. Production RAG requires proof that answers are retrieved from the right sources, grounded in evidence, cited accurately, monitored continuously, and governed responsibly.
Executive view
The KPI cards, quality trend, and release recommendation summarize whether the system is ready for production and where the risk sits.
Technical view
Trace Explorer, Retrieval, and Answer Quality let engineers inspect chunks, ranking, citations, and claim-level support behind each score.
Governance view
Failure Analysis, Governance, and Quality Gates connect quality signals to root causes, maturity, and the release decision.
Headline Metrics
Current run: compliance-guardrail-v6
Up from 76% last run, but still 2 points below the 80% production baseline. Held back by citation accuracy and high-risk failures.
Hybrid search plus reranking lifted retrieval quality above target. The retriever now surfaces the right evidence in most cases.
Answers are mostly grounded in retrieved evidence, but a few responses still over-generalize beyond the source.
Below the 85% threshold. Most failures are citations that are topically related but do not directly support the specific claim.
Down 4 points after the citation validator, but still above the 8% target. Remaining risk concentrates in policy-exception queries.
One critical compliance query still fails. Critical failures must reach zero before promotion to production.
Improving but below the 90% bar. Finance and compliance queries remain the weakest segment.
Average latency is within target, though reranking pushed P95 latency above the 4s SLA.
Cost rose with reranking and validation steps but remains under the $0.045 target.
Quality Trend Across Runs
Overall, retrieval, citation, and faithfulness over six evaluation runs.
Risk Distribution
Evaluated traces by business risk level.
Executive Summary
RAG quality improved from 64% to 78% across six evaluation runs as the team added query rewriting, hybrid search, reranking, citation validation, and compliance guardrails. Retrieval quality now exceeds target at 86%, and critical failures fell from five to one.
Two issues still block production: citation accuracy (82%) remains below the 85% threshold, and P95 latency (4.25s) exceeds the 4-second SLA after reranking. One critical compliance query still fails and is held in human review. Current readiness is Level 3: Controlled Pilot, with a release recommendation of Hold until citation and high-risk gaps close.
Recommended Actions
- EngineeringHigh
Ship claim-to-evidence overlap validation to lift citation accuracy past 85%.
- EngineeringHigh
Cache reranker scores and trim candidate count to bring P95 latency under 4s.
- ComplianceCritical
Resolve the open critical failure on external AI tool data-use guidance.
- Data / ContentHigh
Remove retired Travel Policy v2.7 and revise AI Governance v1.3 ambiguous sections.
- ProductMedium
Expand golden dataset coverage for finance and legal high-risk queries.
Production Readiness
Controlled Pilot
Overall quality of 78% places the system at Level 3. Reaching Level 4 requires citation accuracy at target, zero critical failures, P95 within SLA, and continuous monitoring on live traffic.