Skip to main content

cli-bench.yaml Reference

Complete configuration reference for your benchmark suite.

Full Schema

# Required: The CLI command to benchmark
cli: string

# Optional: Command to get CLI version (skipped if not set)
version_command: string

# Optional: Human-readable name for the CLI (e.g. "Fly.io")
display_name: string

# Optional: Grouping label (e.g. "Cloud", "Developer Tools")
category: string

# Optional: CLI's website URL
website_url: string

# Optional: CLI's GitHub repo URL
github_url: string

# Optional: LLM providers to test with
# Format: provider/model-id (any Vercel AI Gateway model works)
providers:
- anthropic/claude-sonnet-4.6
- google/gemini-3-flash

# Optional: Max parallel API calls (default: 3)
concurrency: 3

# Optional: Working directory for tasks (default: temp directory)
workdir: string

# Optional: Upload behavior (default: auto)
# auto = upload if CLIWATCH_API_KEY is set
# always = always attempt upload (errors are logged, not fatal)
# never = skip upload
upload: auto | always | never

# Optional: Run all tasks N times (default: 1, range: 1-100)
repeat: 1

# Optional: Custom system prompt appended to the default agent prompt
# Use this to give the LLM extra context about your CLI or environment
system_prompt: |
This CLI requires authentication. Run 'mycli auth login --token test' first.
All commands output JSON by default.

# Optional: Env var names to redact from uploaded traces
redact_env:
- SECRET_KEY
- API_TOKEN

# Optional: Regex patterns to redact from uploaded traces
redact_patterns:
- "sk_test_[a-zA-Z0-9]+"
- "Bearer [a-zA-Z0-9._-]+"

# Optional: Environment variables set for every task
# Merged with task-level env (task wins on conflict)
# Supports {{workdir}} template variable
env:
DATABASE_URL: "postgres://localhost/test"
DEBUG: "true"

# Optional: Shell commands run before each task's own setup
setup:
- "docker compose up -d"

# Optional: Directory copied into each task's workdir before setup
# Path is relative to cli-bench.yaml
scaffold: scaffolds/starter

# Optional: Shell commands run after all tasks complete (even on failure)
# Config-level cleanup runs after task-level cleanup
cleanup:
- "docker compose down"

# Optional: Pass rate thresholds
thresholds:
default: 80
models:
anthropic/claude-sonnet-4.6: 90
google/gemini-3-flash: 70
tolerance: 5
behavior: error | informational

# Required: At least one task
tasks:
- id: string # Required: unique task identifier
intent: string # Required: what the LLM should do
assert: # Required: at least one assertion
- exit_code: 0
setup: # Optional: shell commands to run before each task
- "mkdir -p workspace"
env: # Optional: env vars for this task (merges with config-level)
APP_PORT: "3000"
cleanup: # Optional: commands run after task (even on failure)
- "rm -rf /tmp/workspace"
scaffold: scaffolds/alt # Optional: override config-level scaffold (false to disable)
tags: # Optional: string labels for filtering with --tags
- smoke
- core
max_turns: 5 # Optional: max LLM rounds (1-20)
difficulty: easy # Optional: easy | medium | hard
category: string # Optional: grouping label
repeat: 1 # Optional: run N times (1-100)

# External task file reference (supports globs)
- file://tasks/basics.yaml
- file://tasks/advanced/*.yaml

Assertion Types

See the full Assertions Reference for details on all 10 types.

assert:
- exit_code: 0
- output_contains: "text"
- output_equals: "exact text\n"
- error_contains: "warning"
- file_exists: "path/to/file"
- file_contains:
path: "file.txt"
text: "expected content"
- ran: "docker build.*-t"
- not_ran: "rm -rf"
- run_count:
pattern: "curl"
min: 2
max: 5
- verify:
run: "cat output.txt"
output_contains: "expected"

Minimal Example

cli: mycli
tasks:
- id: show-help
intent: "Show the help for mycli"
assert:
- exit_code: 0

Full Example

cli: fly
display_name: "Fly.io"
category: "Cloud"
website_url: "https://fly.io"
github_url: "https://github.com/superfly/flyctl"
version_command: "fly version"

providers:
- anthropic/claude-sonnet-4.6
- google/gemini-3-flash

concurrency: 2
upload: auto

system_prompt: |
The Fly CLI is authenticated. You can run commands directly.
Use `fly` (not `flyctl`) as the CLI command.

redact_env:
- FLY_API_TOKEN

thresholds:
default: 80
tolerance: 5
behavior: error

tasks:
- id: list-apps
intent: "List all deployed Fly applications"
difficulty: easy
category: apps
assert:
- exit_code: 0
- ran: "fly apps list"

- id: check-status
intent: "Show the status of the app named 'my-api'"
difficulty: easy
category: apps
assert:
- exit_code: 0
- ran: "fly status.*--app.*my-api"

- id: scale-memory
intent: "Scale the 'my-api' app to 512MB memory"
difficulty: medium
category: scaling
assert:
- exit_code: 0
- ran: "fly scale memory 512.*--app.*my-api"

- file://tasks/deploy.yaml

Scaffold

The scaffold field copies a directory into each task's working directory before setup commands run. This is useful for tasks that need pre-existing files (config files, source code, project structure).

# Config-level: applies to all tasks
scaffold: scaffolds/starter

tasks:
- id: modify-config
intent: "Update the database URL in config.yaml"
# Uses the config-level scaffold (scaffolds/starter/)
assert:
- file_contains:
path: "config.yaml"
text: "db_url:"

- id: init-project
intent: "Initialize a new project from scratch"
scaffold: false # Disable scaffolding for this task
assert:
- exit_code: 0

- id: migrate-legacy
intent: "Migrate the legacy config format"
scaffold: scaffolds/legacy # Override with a different scaffold
assert:
- exit_code: 0

Scaffold paths are relative to the cli-bench.yaml file.

Environment Variables

The env field sets environment variables for task execution. Config-level env applies to all tasks; task-level env merges on top (task values win on conflict).

The {{workdir}} template variable resolves to the task's working directory.

env:
DATABASE_URL: "postgres://localhost/test"
CONFIG_DIR: "{{workdir}}/config"

tasks:
- id: custom-port
intent: "Start the server on a custom port"
env:
APP_PORT: "8080"
assert:
- exit_code: 0

Cleanup

The cleanup field runs shell commands after task execution, even if the task fails. Task-level cleanup runs first, then config-level cleanup.

cleanup:
- "docker compose down"

tasks:
- id: test-with-db
intent: "Run queries against the test database"
setup:
- "docker compose up -d postgres"
cleanup:
- "docker compose stop postgres"
assert:
- exit_code: 0

Tags

Tag tasks with string labels, then filter with --tags on the command line.

tasks:
- id: show-help
intent: "Show the help output"
tags: [smoke, core]
assert:
- exit_code: 0

- id: deploy-app
intent: "Deploy the application"
tags: [integration]
assert:
- exit_code: 0
# Run only smoke-tagged tasks
cli-bench --tags smoke

# Run multiple tag groups
cli-bench --tags smoke,integration

A task matches if it has any of the specified tags. Tasks without tags are skipped when --tags is used.

External Task Files

Task files referenced with file:// contain a plain array of tasks. Glob patterns are supported:

tasks:
- file://tasks/basics.yaml # single file
- file://tasks/advanced/*.yaml # glob pattern
# tasks/deploy.yaml
- id: deploy-app
intent: "Deploy the current directory as a Fly app"
difficulty: hard
assert:
- exit_code: 0
- ran: "fly deploy"

Duplicate task IDs across files are silently deduplicated (first occurrence wins).

Setup Commands

The setup array runs shell commands before each task execution, in the task's working directory:

tasks:
- id: modify-config
intent: "Update the database URL in config.yaml"
setup:
- "mkdir -p /tmp/workspace"
- "echo 'db_url: localhost' > /tmp/workspace/config.yaml"
assert:
- file_contains:
path: "/tmp/workspace/config.yaml"
text: "db_url:"

Use setup to create files, directories, or preconditions the task depends on. Setup failures are non-fatal; a warning is logged but the task still runs.

CLI Flags

CLI flags override values from cli-bench.yaml:

FlagDescription
--config PATHPath to config file (default: auto-discover cli-bench.yaml)
--filter IDSComma-separated task IDs to run (default: all)
--models MODELSComma-separated model IDs (overrides providers in config)
--tags TAGSComma-separated tags to filter tasks (default: all tasks)
--concurrency NMax parallel API calls (default: 3)
--workdir DIRWorking directory for commands
--repeat NRun each task N times (default: 1)
--output FILEWrite JSON report to file
--uploadForce upload results to CLIWatch
--github-comment FILEWrite PR comment markdown to file (for CI)
--dry-runPrint prompt for first task without calling the LLM

The config file also accepts the .yml extension (cli-bench.yml).

Notes

  • All file paths in assertions are relative to the task working directory
  • The cli value is used as the primary command available to the LLM
  • Provider model IDs use the format provider/model-id
  • Task IDs must be unique across the entire config including external files
  • Use cliwatch validate to check your config before running