Example Benchmark Suite

A complete, copy-paste-ready benchmark project. Replace the placeholder values with your CLI's commands and you're ready to run.

File Structure

All CLIWatch config lives in a cliwatch/ folder in your project root, keeping your repo clean.

your-project/
├── cliwatch/
│   ├── cli-bench.yaml          # Main config: CLI metadata + settings
│   ├── install.sh              # Install script (used in CI)
│   └── tasks/
│       ├── 01-discovery.yaml   # "What do I have?" tasks
│       ├── 02-actions.yaml     # "Do something" tasks
│       └── 03-workflows.yaml   # Multi-step workflows
└── .github/
    └── workflows/
        └── cliwatch.yml        # CI workflow

cliwatch/cli-bench.yaml

The main configuration file.

cli: mycli
version_command: "mycli --version"
display_name: "My CLI"
category: "Developer Tools"

providers:
  - anthropic/claude-haiku-4.5
  - openai/gpt-5-nano

concurrency: 3

system_prompt: |
  You are working in a temporary directory.
  The CLI is installed and ready to use.
  Complete each task using CLI commands.

tasks:
  - "file://tasks/*.yaml"

Key fields:

Field	Required	Description
`cli`	Yes	CLI name (used as project identifier)
`version_command`	No	Command to get version (shown in results)
`display_name`	No	Human-readable name
`providers`	No	LLM models to test with (defaults to Claude Haiku)
`tasks`	Yes	Inline tasks or `file://` glob to load from YAML files
`system_prompt`	No	Context given to the agent before each task

See the full YAML Reference for all options.

cliwatch/install.sh

Used by CI to install the CLI before running benchmarks. Should respect CLI_VERSION for version sweep support.

#!/usr/bin/env bash
set -euo pipefail

if [ -n "${CLI_VERSION:-}" ]; then
  npm install -g mycli@"$CLI_VERSION"
else
  npm install -g mycli
fi

mycli --version

Make it executable: chmod +x cliwatch/install.sh

Task Files

Tasks are the core of your benchmark. Each task gives the agent an intent (what to do) and assertions (how to verify it worked).

cliwatch/tasks/01-discovery.yaml

Discovery tasks test whether the agent can find and list resources. These are usually easy, good starting points.

# Discovery: can the agent find things?

- id: list-resources
  intent: "List all resources in my account."
  difficulty: easy
  category: discovery
  max_turns: 3
  assert:
    - ran: "mycli"
    - exit_code: 0

- id: list-filtered
  intent: "Show me only the active resources."
  difficulty: medium
  category: discovery
  max_turns: 5
  assert:
    - ran: "mycli"

- id: export-list
  intent: "Export all resources as JSON to resources.json."
  difficulty: medium
  category: discovery
  max_turns: 5
  assert:
    - ran: "mycli"
    - file_exists: "resources.json"

cliwatch/tasks/02-actions.yaml

Action tasks test whether the agent can create, modify, or delete resources.

# Actions: can the agent do things?

- id: create-resource
  intent: "Create a new resource called bench-test."
  difficulty: medium
  category: actions
  max_turns: 5
  assert:
    - ran: "mycli"
    - exit_code: 0

- id: update-resource
  intent: "Update the resource bench-test with the tag environment=staging."
  difficulty: medium
  category: actions
  max_turns: 5
  assert:
    - ran: "mycli"

- id: show-details
  intent: "Show me the full details of the resource bench-test."
  difficulty: easy
  category: actions
  max_turns: 3
  assert:
    - ran: "mycli"
    - exit_code: 0

cliwatch/tasks/03-workflows.yaml

Workflow tasks combine multiple steps. These are harder and test whether the agent can chain commands.

# Workflows: multi-step tasks

- id: create-and-verify
  intent: >
    Create a resource called bench-workflow, then verify it exists
    by listing all resources and confirming bench-workflow appears.
  difficulty: hard
  category: workflows
  max_turns: 8
  assert:
    - ran: "mycli"
    - exit_code: 0

- id: export-and-count
  intent: >
    Export all resources as JSON, save to bench-export.json,
    then count how many resources there are.
  difficulty: hard
  category: workflows
  max_turns: 8
  assert:
    - file_exists: "bench-export.json"

Task Design Tips

Start easy. Begin with --help and --version tasks to establish a baseline.
Use natural language. Write intents the way a developer would ask, not as CLI commands.
Categories help. Group tasks by what they test (discovery, actions, workflows) for better reporting.
Difficulty matters. easy (1-2 turns), medium (3-5 turns), hard (5+ turns).
Assertions verify. Use ran: regex to check the right command was called, exit_code: for success, file_exists: for output files.

See Writing Effective Tasks and Assertions for more.

GitHub Actions Workflow

.github/workflows/cliwatch.yml

name: CLI Benchmarks

on:
  pull_request:
  push:
    branches: [main]

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: 22

      - name: Install CLI
        run: bash cliwatch/install.sh

      - name: Run benchmarks
        run: npx @cliwatch/cli-bench --config cliwatch/cli-bench.yaml --upload
        env:
          AI_GATEWAY_API_KEY: ${{ secrets.AI_GATEWAY_API_KEY }}
          CLIWATCH_API_KEY: ${{ secrets.CLIWATCH_API_KEY }}

Required secrets:

Secret	Where to get it
`CLIWATCH_API_KEY`	app.cliwatch.com > API Keys
`AI_GATEWAY_API_KEY`	Your LLM provider (Anthropic, OpenAI, etc.)

Running Locally

# Install cli-bench
npm install -g @cliwatch/cli-bench

# Run without uploading (local only)
npx @cliwatch/cli-bench --config cliwatch/cli-bench.yaml

# Run and upload results to dashboard
npx @cliwatch/cli-bench --config cliwatch/cli-bench.yaml --upload

Next Steps

YAML Reference for all config options
Assertions for all 10 assertion types
Providers & Models for available LLM models
GitHub Actions for CI setup details

File Structure​

cliwatch/cli-bench.yaml​

cliwatch/install.sh​

Task Files​

cliwatch/tasks/01-discovery.yaml​

cliwatch/tasks/02-actions.yaml​

cliwatch/tasks/03-workflows.yaml​

Task Design Tips​

GitHub Actions Workflow​

.github/workflows/cliwatch.yml​

Running Locally​

Next Steps​