Executive Overview · Sudeep Lalka

Run the Live RAG Evaluator Lab

Upload a document, ask a question, and watch the evaluator score retrieval, citations, grounding, and hallucination risk in real time. No setup, no API keys — it runs right here.

Upload a doc

Ask a question

See the evaluation

Open the Lab

How to Read This Dashboard

Basic RAG proves that answers can be generated. Production RAG requires proof that answers are retrieved from the right sources, grounded in evidence, cited accurately, monitored continuously, and governed responsibly.

Executive view

The KPI cards, quality trend, and release recommendation summarize whether the system is ready for production and where the risk sits.

Technical view

Trace Explorer, Retrieval, and Answer Quality let engineers inspect chunks, ranking, citations, and claim-level support behind each score.

Governance view

Failure Analysis, Governance, and Quality Gates connect quality signals to root causes, maturity, and the release decision.

Headline Metrics

Current run: compliance-guardrail-v6

Overall RAG Quality

Watch

78%+2%

Target 80%

Up from 76% last run, but still 2 points below the 80% production baseline. Held back by citation accuracy and high-risk failures.

Retrieval Quality

Healthy

86%+9%

Target 85%

Hybrid search plus reranking lifted retrieval quality above target. The retriever now surfaces the right evidence in most cases.

Answer Faithfulness

Watch

84%+3%

Target 85%

Answers are mostly grounded in retrieved evidence, but a few responses still over-generalize beyond the source.

Citation Accuracy

At Risk

82%+4%

Target 85%

Below the 85% threshold. Most failures are citations that are topically related but do not directly support the specific claim.

Hallucination Risk

Watch

11%-4%

Target 8%

Down 4 points after the citation validator, but still above the 8% target. Remaining risk concentrates in policy-exception queries.

Critical Failure Rate

At Risk

2%-1.3%

Target 0%

One critical compliance query still fails. Critical failures must reach zero before promotion to production.

High-Risk Pass Rate

At Risk

87%+5%

Target 90%

Improving but below the 90% bar. Finance and compliance queries remain the weakest segment.

Average Latency

Healthy

2.6s+0.4

Target 3s

Average latency is within target, though reranking pushed P95 latency above the 4s SLA.

Cost per Query

Healthy

$0.042+0.006

Target $0.045

Cost rose with reranking and validation steps but remains under the $0.045 target.

Quality Trend Across Runs

Overall, retrieval, citation, and faithfulness over six evaluation runs.

Risk Distribution

Evaluated traces by business risk level.

Executive Summary

RAG quality improved from 64% to 78% across six evaluation runs as the team added query rewriting, hybrid search, reranking, citation validation, and compliance guardrails. Retrieval quality now exceeds target at 86%, and critical failures fell from five to one.

Two issues still block production: citation accuracy (82%) remains below the 85% threshold, and P95 latency (4.25s) exceeds the 4-second SLA after reranking. One critical compliance query still fails and is held in human review. Current readiness is Level 3: Controlled Pilot, with a release recommendation of Hold until citation and high-risk gaps close.

Recommended Actions

EngineeringHigh
Ship claim-to-evidence overlap validation to lift citation accuracy past 85%.
EngineeringHigh
Cache reranker scores and trim candidate count to bring P95 latency under 4s.
ComplianceCritical
Resolve the open critical failure on external AI tool data-use guidance.
Data / ContentHigh
Remove retired Travel Policy v2.7 and revise AI Governance v1.3 ambiguous sections.
ProductMedium
Expand golden dataset coverage for finance and legal high-risk queries.

Production Readiness

Level 3of 5

Controlled Pilot

1Basic Demo

2Measured Prototype

3Controlled Pilot

4Production Managed

5Enterprise Scale

Overall quality of 78% places the system at Level 3. Reaching Level 4 requires citation accuracy at target, zero critical failures, P95 within SLA, and continuous monitoring on live traffic.