Thresholds & Tolerance

Thresholds set minimum pass rate requirements. When a model's pass rate falls below the threshold, cli-bench exits with code 1, failing your CI pipeline.

Configuration

thresholds:
  default: 80              # All models must achieve >= 80%
  tolerance: 5             # Allow up to 5% drop from previous run
  behavior: error          # "error" = exit 1, "informational" = warn only
  models:                  # Per-model overrides
    anthropic/claude-sonnet-4.6: 90
    google/gemini-3-flash: 70

Fields

Field	Type	Default	Description
`default`	number (0-100)	-	Default threshold for all models
`models`	object	-	Per-model threshold overrides
`tolerance`	number (0-100)	`0`	Allowed percentage below threshold
`behavior`	string	`error`	`error` or `informational`

How Thresholds Work

After all tasks complete, each model's pass rate is calculated
The threshold is looked up: per-model key first (e.g., anthropic/claude-sonnet-4.6), then default
The effective minimum is max(0, threshold - tolerance)
If the pass rate is below the effective minimum, it's a violation
behavior: error causes exit code 1 (after results upload); behavior: informational prints a warning

Example

With default: 80 and tolerance: 5:

Model achieves 82% → PASS (above threshold)
Model achieves 78% → PASS (within tolerance: 80 - 5 = 75)
Model achieves 70% → FAIL (below threshold minus tolerance)

Per-Model Thresholds

Set different bars for different models:

thresholds:
  default: 70
  models:
    anthropic/claude-sonnet-4.6: 90   # Primary model, high bar
    openai/gpt-5.2: 85                           # Frontier model
    google/gemini-3-flash: 60                        # Budget model, lower bar

Models not listed use the default value. If no default is set and a model has no specific threshold, no threshold is enforced for that model.

Behavior Modes

`error` (default)

Violations cause cli-bench to exit with code 1. Use in CI to block merges.

`informational`

Violations are logged as warnings but don't affect the exit code. Use during development.

CI Integration

# cli-bench.yaml
thresholds:
  default: 80
  tolerance: 5
  behavior: error

The CI job exits with code 1 if thresholds are violated, which fails the GitHub Actions step automatically. See GitHub Actions for the full CI setup.

Tips

Start with behavior: informational until your tasks are stable
Set tolerance to 5-10% for tasks with non-deterministic outcomes
Use per-model thresholds when testing with models of different capabilities

Configuration​

Fields​

How Thresholds Work​

Example​

Per-Model Thresholds​

Behavior Modes​

error (default)​

informational​

CI Integration​

Tips​