Skip to main content

Thresholds & Tolerance

Thresholds set minimum pass rate requirements. When a model's pass rate falls below the threshold, cli-bench exits with code 1, failing your CI pipeline.

Configuration

thresholds:
default: 80 # All models must achieve >= 80%
tolerance: 5 # Allow up to 5% drop from previous run
behavior: error # "error" = exit 1, "informational" = warn only
models: # Per-model overrides
anthropic/claude-sonnet-4-20250514: 90
openai/gpt-4o-mini: 70

Fields

FieldTypeDefaultDescription
defaultnumber (0-100)Default threshold for all models
modelsobjectPer-model threshold overrides
tolerancenumber (0-100)0Allowed percentage below threshold
behaviorstringerrorerror or informational

How Thresholds Work

  1. After all tasks complete, each model's pass rate is calculated
  2. The threshold is looked up: per-model key first (e.g., anthropic/claude-sonnet-4-20250514), then default
  3. The effective minimum is max(0, threshold - tolerance)
  4. If the pass rate is below the effective minimum, it's a violation
  5. behavior: error causes exit code 1 (after results upload); behavior: informational prints a warning

Example

With default: 80 and tolerance: 5:

  • Model achieves 82% → PASS (above threshold)
  • Model achieves 78% → PASS (within tolerance: 80 - 5 = 75)
  • Model achieves 70% → FAIL (below threshold minus tolerance)

Per-Model Thresholds

Set different bars for different models:

thresholds:
default: 70
models:
anthropic/claude-sonnet-4-20250514: 90 # Primary model, high bar
openai/gpt-4o: 85 # Secondary model
openai/gpt-4o-mini: 60 # Budget model, lower bar

Models not listed use the default value. If no default is set and a model has no specific threshold, no threshold is enforced for that model.

Behavior Modes

error (default)

Violations cause cli-bench to exit with code 1. Use in CI to block merges.

informational

Violations are logged as warnings but don't affect the exit code. Use during development.

CI Integration

# cli-bench.yaml
thresholds:
default: 80
tolerance: 5
behavior: error

The CI job exits with code 1 if thresholds are violated, which fails the GitHub Actions step automatically. See GitHub Actions for the full CI setup.

Tips

  • Start with behavior: informational until your tasks are stable
  • Set tolerance to 5-10% for tasks with non-deterministic outcomes
  • Use per-model thresholds when testing with models of different capabilities