Build · RAG

New here? How this lab works

Release Decision

Current Recommendation

Block

3

Passed

3

Warning

2

Failed

0

Not Evaluated

Why Hold?

Citation accuracy (82%) sits below the 85% production threshold — a High-severity gate failure — so the documented decision logic returns Hold rather than promotion. High-risk query pass rate (87%) and P95 latency (4.25s) are on warning, and one critical compliance query remains in human review.

The system should not be promoted until the citation overlap check ships and the open critical failure is resolved. Cost, regression tolerance, and human-review completion all pass, so a focused fix on citations and latency would move this to Promote with Monitoring.

Quality Gate Status

Eight gates spanning quality, risk, latency, cost, regression, and governance.

GateThresholdCurrentStatusSeverityRemediation
Overall Quality ScoreComposite quality must meet the production baseline.>= 80%78%WarningHighClose citation and high-risk gaps; 2 points from baseline.
Citation AccuracyCitations must directly support the claims they are attached to.>= 85%82%FailedHighAdd claim-level citation validation requiring direct evidence overlap before displaying a citation.
Critical Hallucination FailuresNo unsupported high-impact claims on critical-risk queries.= 01FailedCriticalEnforce claim-level grounding and refuse partial answers on critical policy queries (e.g. external AI tool data use).
High-Risk Query Pass RateReliability on finance, legal, compliance, and security queries.>= 90%87%WarningCriticalImprove finance and compliance retrieval with metadata filtering and clearer source documents.
P95 Latency95th-percentile end-to-end latency within SLA.<= 4.0s4.25sWarningMediumCache reranker scores and reduce rerank candidate count to bring P95 under 4s.
Cost per QueryFully-loaded cost per query within budget.<= $0.045$0.042PassedMediumWithin target. Monitor as evaluation passes expand.
Regression LimitNo overall regression > 3% or citation regression > 5% vs prior run.Within limitsNo regressionPassedHighLatency is on watch but no metric regressed beyond tolerance this run.
Human Review CompletionAll critical-domain cases have completed required human review.100% critical100%PassedCriticalAll critical-risk traces were routed and reviewed for this run.

Remediation Plan

What must change to clear the open gates before promotion.

Overall Quality Score

Warning

Target >= 80% · currently 78%

Close citation and high-risk gaps; 2 points from baseline.

Citation Accuracy

Failed

Target >= 85% · currently 82%

Add claim-level citation validation requiring direct evidence overlap before displaying a citation.

Critical Hallucination Failures

Failed

Target = 0 · currently 1

Enforce claim-level grounding and refuse partial answers on critical policy queries (e.g. external AI tool data use).

High-Risk Query Pass Rate

Warning

Target >= 90% · currently 87%

Improve finance and compliance retrieval with metadata filtering and clearer source documents.

P95 Latency

Warning

Target <= 4.0s · currently 4.25s

Cache reranker scores and reduce rerank candidate count to bring P95 under 4s.