Skip to main content

cli-bench.yaml Reference

Complete configuration reference for your benchmark suite.

Full Schema

# Required: The CLI command to benchmark
cli: string

# Optional: Command to get CLI version (default: "<cli> --version")
version_command: string

# Optional: LLM providers to test with (default: anthropic/claude-sonnet-4-20250514)
# Format: provider/model-id — any Vercel AI Gateway model works
providers:
- anthropic/claude-sonnet-4-20250514
- openai/gpt-4o

# Optional: How the LLM learns about your CLI (default: ["injected"])
# See "Help Modes" section below for details
help_modes:
- injected
- discoverable

# Optional: Max parallel API calls (default: 3)
concurrency: 3

# Optional: Working directory for tasks (default: temp directory)
workdir: string

# Optional: Upload behavior (default: auto)
# auto = upload if CLIWATCH_API_KEY is set
# always = always attempt upload (errors are logged, not fatal)
# never = skip upload
upload: auto | always | never

# Optional: Run all tasks N times (default: 1, range: 1-100)
repeat: 1

# Optional: Pass rate thresholds
thresholds:
default: 80
models:
anthropic/claude-sonnet-4-20250514: 90
openai/gpt-4o-mini: 70
tolerance: 5
behavior: error | informational

# Required: At least one task
tasks:
- id: string # Required: unique task identifier
intent: string # Required: what the LLM should do
assert: # Required: at least one assertion
- exit_code: 0
setup: # Optional: shell commands to run before each task
- "mkdir -p workspace"
max_turns: 5 # Optional: max LLM rounds (1-20)
difficulty: easy # Optional: easy | medium | hard
category: string # Optional: grouping label
repeat: 1 # Optional: run N times (1-100)

# External task file reference (supports globs)
- file://tasks/basics.yaml
- file://tasks/advanced/*.yaml

Assertion Types

See the full Assertions Reference for details on all 10 types.

assert:
- exit_code: 0
- output_contains: "text"
- output_equals: "exact text\n"
- error_contains: "warning"
- file_exists: "path/to/file"
- file_contains:
path: "file.txt"
text: "expected content"
- ran: "docker build.*-t"
- not_ran: "rm -rf"
- run_count:
pattern: "curl"
min: 2
max: 5
- verify:
run: "cat output.txt"
output_contains: "expected"

Minimal Example

cli: mycli
tasks:
- id: show-help
intent: "Show the help for mycli"
assert:
- exit_code: 0

Full Example

cli: docker
version_command: "docker --version"

providers:
- anthropic/claude-sonnet-4-20250514
- openai/gpt-4o

help_modes:
- injected
- discoverable

concurrency: 2
upload: auto

thresholds:
default: 80
tolerance: 5
behavior: error

tasks:
- id: show-help
intent: "Show the Docker help information"
difficulty: easy
category: discovery
assert:
- exit_code: 0
- output_contains: "Usage"

- id: pull-image
intent: "Pull the official nginx image"
difficulty: easy
category: images
assert:
- exit_code: 0
- ran: "docker pull.*nginx"

- id: run-container
intent: "Run an nginx container in detached mode, mapping port 8080 to 80"
difficulty: medium
category: containers
assert:
- exit_code: 0
- ran: "docker run.*-d.*-p.*8080:80.*nginx"

- file://tasks/advanced-docker.yaml

Help Modes

Help modes control how the LLM learns about your CLI before attempting the task. This is one of the most important benchmarking dimensions — it tests whether an agent can use your CLI with different levels of prior knowledge.

ModeWhat the LLM seesAgent behavior
injected--help output included in the promptReads the help text, then runs commands. Fastest — no discovery turns needed.
discoverableNothing — must explore on its ownRuns <cli> --help and <cli> <subcommand> --help to learn commands and flags before attempting the task. Tests real-world agent behavior.
noneNothing — cannot run --helpRelies entirely on training knowledge. Tests whether the model already "knows" your CLI.
# Test all three modes (generates results for each)
help_modes:
- injected
- discoverable
- none

Each help mode produces separate results in the benchmark matrix. If you configure 2 models and 3 help modes, you get 6 result rows.

Recommendation: Start with injected for fast iteration, then add discoverable to test real-world agent behavior. Use none only if you want to measure baseline model knowledge.

For more details, see Help Modes.

External Task Files

Task files referenced with file:// contain a plain array of tasks. Glob patterns are supported:

tasks:
- file://tasks/basics.yaml # single file
- file://tasks/advanced/*.yaml # glob pattern
# tasks/advanced-docker.yaml
- id: multi-stage-build
intent: "Create a Dockerfile with a multi-stage build"
difficulty: hard
assert:
- file_exists: "Dockerfile"
- file_contains:
path: "Dockerfile"
text: "FROM"

Duplicate task IDs across files are silently deduplicated (first occurrence wins).

Setup Commands

The setup array runs shell commands before each task execution, in the task's working directory:

tasks:
- id: modify-config
intent: "Update the database URL in config.yaml"
setup:
- "mkdir -p /tmp/workspace"
- "echo 'db_url: localhost' > /tmp/workspace/config.yaml"
assert:
- file_contains:
path: "/tmp/workspace/config.yaml"
text: "db_url:"

Use setup to create files, directories, or preconditions the task depends on. Setup failures are non-fatal — a warning is logged but the task still runs.

CLI Flags

CLI flags override values from cli-bench.yaml:

FlagDescription
--config PATHPath to config file (default: auto-discover cli-bench.yaml)
--filter CLISComma-separated CLI names to run (default: all)
--models MODELSComma-separated model IDs (overrides providers in config)
--help-modes MODESHelp modes: injected,discoverable,none (default: injected)
--concurrency NMax parallel API calls (default: 3)
--workdir DIRWorking directory for commands
--repeat NRun each task N times (default: 1)
--output FILEWrite JSON report to file
--uploadForce upload results to CLIWatch
--dry-runPrint prompt for first task without calling the LLM

The config file also accepts the .yml extension (cli-bench.yml).

Notes

  • All file paths in assertions are relative to the task working directory
  • The cli value is used as the primary command available to the LLM
  • Provider model IDs use the format provider/model-id
  • Task IDs must be unique across the entire config including external files
  • Use cliwatch validate to check your config before running