Content moderation — the throughput-vs-harm curve
Over-automate and a borderline post slips through as harm.
Open the live lab · pre-loaded to this scenario
Human-in-the-Loop Approval
Context
A moderation agent processes flagged posts. High-harm content auto-removes and clearly-benign content auto-approves, but the borderline, context-dependent cases are where over-automation lets harm through.
The decision
The dial is throughput vs harm: the level that still routes the borderline cases to a human is the ceiling — one step further trades a harm incident for a little more speed.
What most miss
People tune moderation autonomy for cost. The borderline tier is where the harm actually is; automating it is where a moderation program gets its worst headlines.
Stakes
A single over-automated borderline case that should have been reviewed is a trust-and-safety incident.
Studied · Agent & Protocol · verified 2026-07-03
Sources: Trust-and-safety moderation autonomy patterns; Harm-severity-tiered review policy