Build · RAG

New here? How this lab works
InteractiveTry it yourself

Run the Live RAG Evaluator Lab

Upload a document, ask a question, and watch the evaluator score retrieval, citations, grounding, and hallucination risk in real time. No setup, no API keys — it runs right here.

Open the Lab

How to Read This Dashboard

Basic RAG proves that answers can be generated. Production RAG requires proof that answers are retrieved from the right sources, grounded in evidence, cited accurately, monitored continuously, and governed responsibly.

Executive view

The KPI cards, quality trend, and release recommendation summarize whether the system is ready for production and where the risk sits.

Technical view

Trace Explorer, Retrieval, and Answer Quality let engineers inspect chunks, ranking, citations, and claim-level support behind each score.

Governance view

Failure Analysis, Governance, and Quality Gates connect quality signals to root causes, maturity, and the release decision.

Headline Metrics

Current run: compliance-guardrail-v6

Overall RAG Quality
Watch
78%+2%
Target 80%

Up from 76% last run, but still 2 points below the 80% production baseline. Held back by citation accuracy and high-risk failures.

Retrieval Quality
Healthy
86%+9%
Target 85%

Hybrid search plus reranking lifted retrieval quality above target. The retriever now surfaces the right evidence in most cases.

Answer Faithfulness
Watch
84%+3%
Target 85%

Answers are mostly grounded in retrieved evidence, but a few responses still over-generalize beyond the source.

Citation Accuracy
At Risk
82%+4%
Target 85%

Below the 85% threshold. Most failures are citations that are topically related but do not directly support the specific claim.

Hallucination Risk
Watch
11%-4%
Target 8%

Down 4 points after the citation validator, but still above the 8% target. Remaining risk concentrates in policy-exception queries.

Critical Failure Rate
At Risk
2%-1.3%
Target 0%

One critical compliance query still fails. Critical failures must reach zero before promotion to production.

High-Risk Pass Rate
At Risk
87%+5%
Target 90%

Improving but below the 90% bar. Finance and compliance queries remain the weakest segment.

Average Latency
Healthy
2.6s+0.4
Target 3s

Average latency is within target, though reranking pushed P95 latency above the 4s SLA.

Cost per Query
Healthy
$0.042+0.006
Target $0.045

Cost rose with reranking and validation steps but remains under the $0.045 target.

Quality Trend Across Runs

Overall, retrieval, citation, and faithfulness over six evaluation runs.

Risk Distribution

Evaluated traces by business risk level.

Executive Summary

RAG quality improved from 64% to 78% across six evaluation runs as the team added query rewriting, hybrid search, reranking, citation validation, and compliance guardrails. Retrieval quality now exceeds target at 86%, and critical failures fell from five to one.

Two issues still block production: citation accuracy (82%) remains below the 85% threshold, and P95 latency (4.25s) exceeds the 4-second SLA after reranking. One critical compliance query still fails and is held in human review. Current readiness is Level 3: Controlled Pilot, with a release recommendation of Hold until citation and high-risk gaps close.

Recommended Actions

  • EngineeringHigh

    Ship claim-to-evidence overlap validation to lift citation accuracy past 85%.

  • EngineeringHigh

    Cache reranker scores and trim candidate count to bring P95 latency under 4s.

  • ComplianceCritical

    Resolve the open critical failure on external AI tool data-use guidance.

  • Data / ContentHigh

    Remove retired Travel Policy v2.7 and revise AI Governance v1.3 ambiguous sections.

  • ProductMedium

    Expand golden dataset coverage for finance and legal high-risk queries.

Production Readiness

Level 3of 5

Controlled Pilot

1Basic Demo
2Measured Prototype
3Controlled Pilot
4Production Managed
5Enterprise Scale

Overall quality of 78% places the system at Level 3. Reaching Level 4 requires citation accuracy at target, zero critical failures, P95 within SLA, and continuous monitoring on live traffic.