Skip to main content

Example Benchmark Suite

A complete, copy-paste-ready benchmark project. Replace the placeholder values with your CLI's commands and you're ready to run.

File Structure

All CLIWatch config lives in a cliwatch/ folder in your project root, keeping your repo clean.

your-project/
├── cliwatch/
│ ├── cli-bench.yaml # Main config: CLI metadata + settings
│ ├── install.sh # Install script (used in CI)
│ └── tasks/
│ ├── 01-discovery.yaml # "What do I have?" tasks
│ ├── 02-actions.yaml # "Do something" tasks
│ └── 03-workflows.yaml # Multi-step workflows
└── .github/
└── workflows/
└── cliwatch.yml # CI workflow

cliwatch/cli-bench.yaml

The main configuration file.

cli: mycli
version_command: "mycli --version"
display_name: "My CLI"
category: "Developer Tools"

providers:
- anthropic/claude-haiku-4.5
- openai/gpt-5-nano

concurrency: 3

system_prompt: |
You are working in a temporary directory.
The CLI is installed and ready to use.
Complete each task using CLI commands.

tasks:
- "file://tasks/*.yaml"

Key fields:

FieldRequiredDescription
cliYesCLI name (used as project identifier)
version_commandNoCommand to get version (shown in results)
display_nameNoHuman-readable name
providersNoLLM models to test with (defaults to Claude Haiku)
tasksYesInline tasks or file:// glob to load from YAML files
system_promptNoContext given to the agent before each task

See the full YAML Reference for all options.

cliwatch/install.sh

Used by CI to install the CLI before running benchmarks. Should respect CLI_VERSION for version sweep support.

#!/usr/bin/env bash
set -euo pipefail

if [ -n "${CLI_VERSION:-}" ]; then
npm install -g mycli@"$CLI_VERSION"
else
npm install -g mycli
fi

mycli --version

Make it executable: chmod +x cliwatch/install.sh

Task Files

Tasks are the core of your benchmark. Each task gives the agent an intent (what to do) and assertions (how to verify it worked).

cliwatch/tasks/01-discovery.yaml

Discovery tasks test whether the agent can find and list resources. These are usually easy, good starting points.

# Discovery: can the agent find things?

- id: list-resources
intent: "List all resources in my account."
difficulty: easy
category: discovery
max_turns: 3
assert:
- ran: "mycli"
- exit_code: 0

- id: list-filtered
intent: "Show me only the active resources."
difficulty: medium
category: discovery
max_turns: 5
assert:
- ran: "mycli"

- id: export-list
intent: "Export all resources as JSON to resources.json."
difficulty: medium
category: discovery
max_turns: 5
assert:
- ran: "mycli"
- file_exists: "resources.json"

cliwatch/tasks/02-actions.yaml

Action tasks test whether the agent can create, modify, or delete resources.

# Actions: can the agent do things?

- id: create-resource
intent: "Create a new resource called bench-test."
difficulty: medium
category: actions
max_turns: 5
assert:
- ran: "mycli"
- exit_code: 0

- id: update-resource
intent: "Update the resource bench-test with the tag environment=staging."
difficulty: medium
category: actions
max_turns: 5
assert:
- ran: "mycli"

- id: show-details
intent: "Show me the full details of the resource bench-test."
difficulty: easy
category: actions
max_turns: 3
assert:
- ran: "mycli"
- exit_code: 0

cliwatch/tasks/03-workflows.yaml

Workflow tasks combine multiple steps. These are harder and test whether the agent can chain commands.

# Workflows: multi-step tasks

- id: create-and-verify
intent: >
Create a resource called bench-workflow, then verify it exists
by listing all resources and confirming bench-workflow appears.
difficulty: hard
category: workflows
max_turns: 8
assert:
- ran: "mycli"
- exit_code: 0

- id: export-and-count
intent: >
Export all resources as JSON, save to bench-export.json,
then count how many resources there are.
difficulty: hard
category: workflows
max_turns: 8
assert:
- file_exists: "bench-export.json"

Task Design Tips

  • Start easy. Begin with --help and --version tasks to establish a baseline.
  • Use natural language. Write intents the way a developer would ask, not as CLI commands.
  • Categories help. Group tasks by what they test (discovery, actions, workflows) for better reporting.
  • Difficulty matters. easy (1-2 turns), medium (3-5 turns), hard (5+ turns).
  • Assertions verify. Use ran: regex to check the right command was called, exit_code: for success, file_exists: for output files.

See Writing Effective Tasks and Assertions for more.

GitHub Actions Workflow

.github/workflows/cliwatch.yml

name: CLI Benchmarks

on:
pull_request:
push:
branches: [main]

jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- uses: actions/setup-node@v4
with:
node-version: 22

- name: Install CLI
run: bash cliwatch/install.sh

- name: Run benchmarks
run: npx @cliwatch/cli-bench --config cliwatch/cli-bench.yaml --upload
env:
AI_GATEWAY_API_KEY: ${{ secrets.AI_GATEWAY_API_KEY }}
CLIWATCH_API_KEY: ${{ secrets.CLIWATCH_API_KEY }}

Required secrets:

SecretWhere to get it
CLIWATCH_API_KEYapp.cliwatch.com > API Keys
AI_GATEWAY_API_KEYYour LLM provider (Anthropic, OpenAI, etc.)

Running Locally

# Install cli-bench
npm install -g @cliwatch/cli-bench

# Run without uploading (local only)
npx @cliwatch/cli-bench --config cliwatch/cli-bench.yaml

# Run and upload results to dashboard
npx @cliwatch/cli-bench --config cliwatch/cli-bench.yaml --upload

Next Steps