Thresholds & Tolerance
Thresholds set minimum pass rate requirements. When a model's pass rate falls below the threshold, cli-bench exits with code 1, failing your CI pipeline.
Configuration
thresholds:
default: 80 # All models must achieve >= 80%
tolerance: 5 # Allow up to 5% drop from previous run
behavior: error # "error" = exit 1, "informational" = warn only
models: # Per-model overrides
anthropic/claude-sonnet-4-20250514: 90
openai/gpt-4o-mini: 70
Fields
| Field | Type | Default | Description |
|---|---|---|---|
default | number (0-100) | — | Default threshold for all models |
models | object | — | Per-model threshold overrides |
tolerance | number (0-100) | 0 | Allowed percentage below threshold |
behavior | string | error | error or informational |
How Thresholds Work
- After all tasks complete, each model's pass rate is calculated
- The threshold is looked up: per-model key first (e.g.,
anthropic/claude-sonnet-4-20250514), thendefault - The effective minimum is
max(0, threshold - tolerance) - If the pass rate is below the effective minimum, it's a violation
behavior: errorcauses exit code 1 (after results upload);behavior: informationalprints a warning
Example
With default: 80 and tolerance: 5:
- Model achieves 82% → PASS (above threshold)
- Model achieves 78% → PASS (within tolerance: 80 - 5 = 75)
- Model achieves 70% → FAIL (below threshold minus tolerance)
Per-Model Thresholds
Set different bars for different models:
thresholds:
default: 70
models:
anthropic/claude-sonnet-4-20250514: 90 # Primary model, high bar
openai/gpt-4o: 85 # Secondary model
openai/gpt-4o-mini: 60 # Budget model, lower bar
Models not listed use the default value. If no default is set and a model has no specific threshold, no threshold is enforced for that model.
Behavior Modes
error (default)
Violations cause cli-bench to exit with code 1. Use in CI to block merges.
informational
Violations are logged as warnings but don't affect the exit code. Use during development.
CI Integration
# cli-bench.yaml
thresholds:
default: 80
tolerance: 5
behavior: error
The CI job exits with code 1 if thresholds are violated, which fails the GitHub Actions step automatically. See GitHub Actions for the full CI setup.
Tips
- Start with
behavior: informationaluntil your tasks are stable - Set
toleranceto 5-10% for tasks with non-deterministic outcomes - Use per-model thresholds when testing with models of different capabilities