cli-bench.yaml Reference

Complete configuration reference for your benchmark suite.

Full Schema

# Required: The CLI command to benchmark
cli: string

# Optional: Command to get CLI version (default: "<cli> --version")
version_command: string

# Optional: LLM providers to test with (default: anthropic/claude-sonnet-4-20250514)
# Format: provider/model-id — any Vercel AI Gateway model works
providers:
  - anthropic/claude-sonnet-4-20250514
  - openai/gpt-4o

# Optional: How the LLM learns about your CLI (default: ["injected"])
# See "Help Modes" section below for details
help_modes:
  - injected
  - discoverable

# Optional: Max parallel API calls (default: 3)
concurrency: 3

# Optional: Working directory for tasks (default: temp directory)
workdir: string

# Optional: Upload behavior (default: auto)
# auto = upload if CLIWATCH_API_KEY is set
# always = always attempt upload (errors are logged, not fatal)
# never = skip upload
upload: auto | always | never

# Optional: Run all tasks N times (default: 1, range: 1-100)
repeat: 1

# Optional: Pass rate thresholds
thresholds:
  default: 80
  models:
    anthropic/claude-sonnet-4-20250514: 90
    openai/gpt-4o-mini: 70
  tolerance: 5
  behavior: error | informational

# Required: At least one task
tasks:
  - id: string              # Required: unique task identifier
    intent: string           # Required: what the LLM should do
    assert:                  # Required: at least one assertion
      - exit_code: 0
    setup:                   # Optional: shell commands to run before each task
      - "mkdir -p workspace"
    max_turns: 5             # Optional: max LLM rounds (1-20)
    difficulty: easy         # Optional: easy | medium | hard
    category: string         # Optional: grouping label
    repeat: 1                # Optional: run N times (1-100)

  # External task file reference (supports globs)
  - file://tasks/basics.yaml
  - file://tasks/advanced/*.yaml

Assertion Types

See the full Assertions Reference for details on all 10 types.

assert:
  - exit_code: 0
  - output_contains: "text"
  - output_equals: "exact text\n"
  - error_contains: "warning"
  - file_exists: "path/to/file"
  - file_contains:
      path: "file.txt"
      text: "expected content"
  - ran: "docker build.*-t"
  - not_ran: "rm -rf"
  - run_count:
      pattern: "curl"
      min: 2
      max: 5
  - verify:
      run: "cat output.txt"
      output_contains: "expected"

Minimal Example

cli: mycli
tasks:
  - id: show-help
    intent: "Show the help for mycli"
    assert:
      - exit_code: 0

Full Example

cli: docker
version_command: "docker --version"

providers:
  - anthropic/claude-sonnet-4-20250514
  - openai/gpt-4o

help_modes:
  - injected
  - discoverable

concurrency: 2
upload: auto

thresholds:
  default: 80
  tolerance: 5
  behavior: error

tasks:
  - id: show-help
    intent: "Show the Docker help information"
    difficulty: easy
    category: discovery
    assert:
      - exit_code: 0
      - output_contains: "Usage"

  - id: pull-image
    intent: "Pull the official nginx image"
    difficulty: easy
    category: images
    assert:
      - exit_code: 0
      - ran: "docker pull.*nginx"

  - id: run-container
    intent: "Run an nginx container in detached mode, mapping port 8080 to 80"
    difficulty: medium
    category: containers
    assert:
      - exit_code: 0
      - ran: "docker run.*-d.*-p.*8080:80.*nginx"

  - file://tasks/advanced-docker.yaml

Help Modes

Help modes control how the LLM learns about your CLI before attempting the task. This is one of the most important benchmarking dimensions — it tests whether an agent can use your CLI with different levels of prior knowledge.

Mode	What the LLM sees	Agent behavior
`injected`	`--help` output included in the prompt	Reads the help text, then runs commands. Fastest — no discovery turns needed.
`discoverable`	Nothing — must explore on its own	Runs `<cli> --help` and `<cli> <subcommand> --help` to learn commands and flags before attempting the task. Tests real-world agent behavior.
`none`	Nothing — cannot run `--help`	Relies entirely on training knowledge. Tests whether the model already "knows" your CLI.

# Test all three modes (generates results for each)
help_modes:
  - injected
  - discoverable
  - none

Each help mode produces separate results in the benchmark matrix. If you configure 2 models and 3 help modes, you get 6 result rows.

Recommendation: Start with injected for fast iteration, then add discoverable to test real-world agent behavior. Use none only if you want to measure baseline model knowledge.

For more details, see Help Modes.

External Task Files

Task files referenced with file:// contain a plain array of tasks. Glob patterns are supported:

tasks:
  - file://tasks/basics.yaml          # single file
  - file://tasks/advanced/*.yaml      # glob pattern

# tasks/advanced-docker.yaml
- id: multi-stage-build
  intent: "Create a Dockerfile with a multi-stage build"
  difficulty: hard
  assert:
    - file_exists: "Dockerfile"
    - file_contains:
        path: "Dockerfile"
        text: "FROM"

Duplicate task IDs across files are silently deduplicated (first occurrence wins).

Setup Commands

The setup array runs shell commands before each task execution, in the task's working directory:

tasks:
  - id: modify-config
    intent: "Update the database URL in config.yaml"
    setup:
      - "mkdir -p /tmp/workspace"
      - "echo 'db_url: localhost' > /tmp/workspace/config.yaml"
    assert:
      - file_contains:
          path: "/tmp/workspace/config.yaml"
          text: "db_url:"

Use setup to create files, directories, or preconditions the task depends on. Setup failures are non-fatal — a warning is logged but the task still runs.

CLI Flags

CLI flags override values from cli-bench.yaml:

Flag	Description
`--config PATH`	Path to config file (default: auto-discover `cli-bench.yaml`)
`--filter CLIS`	Comma-separated CLI names to run (default: all)
`--models MODELS`	Comma-separated model IDs (overrides `providers` in config)
`--help-modes MODES`	Help modes: `injected,discoverable,none` (default: `injected`)
`--concurrency N`	Max parallel API calls (default: `3`)
`--workdir DIR`	Working directory for commands
`--repeat N`	Run each task N times (default: `1`)
`--output FILE`	Write JSON report to file
`--upload`	Force upload results to CLIWatch
`--dry-run`	Print prompt for first task without calling the LLM

The config file also accepts the .yml extension (cli-bench.yml).

Notes

All file paths in assertions are relative to the task working directory
The cli value is used as the primary command available to the LLM
Provider model IDs use the format provider/model-id
Task IDs must be unique across the entire config including external files
Use cliwatch validate to check your config before running

Full Schema​

Assertion Types​

Minimal Example​

Full Example​

Help Modes​

External Task Files​

Setup Commands​

CLI Flags​

Notes​