cli-bench.yaml Reference
Complete configuration reference for your benchmark suite.
Full Schema
# Required: The CLI command to benchmark
cli: string
# Optional: Command to get CLI version (default: "<cli> --version")
version_command: string
# Optional: LLM providers to test with (default: anthropic/claude-sonnet-4-20250514)
# Format: provider/model-id — any Vercel AI Gateway model works
providers:
- anthropic/claude-sonnet-4-20250514
- openai/gpt-4o
# Optional: How the LLM learns about your CLI (default: ["injected"])
# See "Help Modes" section below for details
help_modes:
- injected
- discoverable
# Optional: Max parallel API calls (default: 3)
concurrency: 3
# Optional: Working directory for tasks (default: temp directory)
workdir: string
# Optional: Upload behavior (default: auto)
# auto = upload if CLIWATCH_API_KEY is set
# always = always attempt upload (errors are logged, not fatal)
# never = skip upload
upload: auto | always | never
# Optional: Run all tasks N times (default: 1, range: 1-100)
repeat: 1
# Optional: Pass rate thresholds
thresholds:
default: 80
models:
anthropic/claude-sonnet-4-20250514: 90
openai/gpt-4o-mini: 70
tolerance: 5
behavior: error | informational
# Required: At least one task
tasks:
- id: string # Required: unique task identifier
intent: string # Required: what the LLM should do
assert: # Required: at least one assertion
- exit_code: 0
setup: # Optional: shell commands to run before each task
- "mkdir -p workspace"
max_turns: 5 # Optional: max LLM rounds (1-20)
difficulty: easy # Optional: easy | medium | hard
category: string # Optional: grouping label
repeat: 1 # Optional: run N times (1-100)
# External task file reference (supports globs)
- file://tasks/basics.yaml
- file://tasks/advanced/*.yaml
Assertion Types
See the full Assertions Reference for details on all 10 types.
assert:
- exit_code: 0
- output_contains: "text"
- output_equals: "exact text\n"
- error_contains: "warning"
- file_exists: "path/to/file"
- file_contains:
path: "file.txt"
text: "expected content"
- ran: "docker build.*-t"
- not_ran: "rm -rf"
- run_count:
pattern: "curl"
min: 2
max: 5
- verify:
run: "cat output.txt"
output_contains: "expected"
Minimal Example
cli: mycli
tasks:
- id: show-help
intent: "Show the help for mycli"
assert:
- exit_code: 0
Full Example
cli: docker
version_command: "docker --version"
providers:
- anthropic/claude-sonnet-4-20250514
- openai/gpt-4o
help_modes:
- injected
- discoverable
concurrency: 2
upload: auto
thresholds:
default: 80
tolerance: 5
behavior: error
tasks:
- id: show-help
intent: "Show the Docker help information"
difficulty: easy
category: discovery
assert:
- exit_code: 0
- output_contains: "Usage"
- id: pull-image
intent: "Pull the official nginx image"
difficulty: easy
category: images
assert:
- exit_code: 0
- ran: "docker pull.*nginx"
- id: run-container
intent: "Run an nginx container in detached mode, mapping port 8080 to 80"
difficulty: medium
category: containers
assert:
- exit_code: 0
- ran: "docker run.*-d.*-p.*8080:80.*nginx"
- file://tasks/advanced-docker.yaml
Help Modes
Help modes control how the LLM learns about your CLI before attempting the task. This is one of the most important benchmarking dimensions — it tests whether an agent can use your CLI with different levels of prior knowledge.
| Mode | What the LLM sees | Agent behavior |
|---|---|---|
injected | --help output included in the prompt | Reads the help text, then runs commands. Fastest — no discovery turns needed. |
discoverable | Nothing — must explore on its own | Runs <cli> --help and <cli> <subcommand> --help to learn commands and flags before attempting the task. Tests real-world agent behavior. |
none | Nothing — cannot run --help | Relies entirely on training knowledge. Tests whether the model already "knows" your CLI. |
# Test all three modes (generates results for each)
help_modes:
- injected
- discoverable
- none
Each help mode produces separate results in the benchmark matrix. If you configure 2 models and 3 help modes, you get 6 result rows.
Recommendation: Start with injected for fast iteration, then add discoverable to test real-world agent behavior. Use none only if you want to measure baseline model knowledge.
For more details, see Help Modes.
External Task Files
Task files referenced with file:// contain a plain array of tasks. Glob patterns are supported:
tasks:
- file://tasks/basics.yaml # single file
- file://tasks/advanced/*.yaml # glob pattern
# tasks/advanced-docker.yaml
- id: multi-stage-build
intent: "Create a Dockerfile with a multi-stage build"
difficulty: hard
assert:
- file_exists: "Dockerfile"
- file_contains:
path: "Dockerfile"
text: "FROM"
Duplicate task IDs across files are silently deduplicated (first occurrence wins).
Setup Commands
The setup array runs shell commands before each task execution, in the task's working directory:
tasks:
- id: modify-config
intent: "Update the database URL in config.yaml"
setup:
- "mkdir -p /tmp/workspace"
- "echo 'db_url: localhost' > /tmp/workspace/config.yaml"
assert:
- file_contains:
path: "/tmp/workspace/config.yaml"
text: "db_url:"
Use setup to create files, directories, or preconditions the task depends on. Setup failures are non-fatal — a warning is logged but the task still runs.
CLI Flags
CLI flags override values from cli-bench.yaml:
| Flag | Description |
|---|---|
--config PATH | Path to config file (default: auto-discover cli-bench.yaml) |
--filter CLIS | Comma-separated CLI names to run (default: all) |
--models MODELS | Comma-separated model IDs (overrides providers in config) |
--help-modes MODES | Help modes: injected,discoverable,none (default: injected) |
--concurrency N | Max parallel API calls (default: 3) |
--workdir DIR | Working directory for commands |
--repeat N | Run each task N times (default: 1) |
--output FILE | Write JSON report to file |
--upload | Force upload results to CLIWatch |
--dry-run | Print prompt for first task without calling the LLM |
The config file also accepts the .yml extension (cli-bench.yml).
Notes
- All file paths in assertions are relative to the task working directory
- The
clivalue is used as the primary command available to the LLM - Provider model IDs use the format
provider/model-id - Task IDs must be unique across the entire config including external files
- Use
cliwatch validateto check your config before running