Example Benchmark Suite
A complete, copy-paste-ready benchmark project. Replace the placeholder values with your CLI's commands and you're ready to run.
File Structure
All CLIWatch config lives in a cliwatch/ folder in your project root, keeping your repo clean.
your-project/
├── cliwatch/
│ ├── cli-bench.yaml # Main config: CLI metadata + settings
│ ├── install.sh # Install script (used in CI)
│ └── tasks/
│ ├── 01-discovery.yaml # "What do I have?" tasks
│ ├── 02-actions.yaml # "Do something" tasks
│ └── 03-workflows.yaml # Multi-step workflows
└── .github/
└── workflows/
└── cliwatch.yml # CI workflow
cliwatch/cli-bench.yaml
The main configuration file.
cli: mycli
version_command: "mycli --version"
display_name: "My CLI"
category: "Developer Tools"
providers:
- anthropic/claude-haiku-4.5
- openai/gpt-5-nano
concurrency: 3
system_prompt: |
You are working in a temporary directory.
The CLI is installed and ready to use.
Complete each task using CLI commands.
tasks:
- "file://tasks/*.yaml"
Key fields:
| Field | Required | Description |
|---|---|---|
cli | Yes | CLI name (used as project identifier) |
version_command | No | Command to get version (shown in results) |
display_name | No | Human-readable name |
providers | No | LLM models to test with (defaults to Claude Haiku) |
tasks | Yes | Inline tasks or file:// glob to load from YAML files |
system_prompt | No | Context given to the agent before each task |
See the full YAML Reference for all options.
cliwatch/install.sh
Used by CI to install the CLI before running benchmarks. Should respect CLI_VERSION for version sweep support.
#!/usr/bin/env bash
set -euo pipefail
if [ -n "${CLI_VERSION:-}" ]; then
npm install -g mycli@"$CLI_VERSION"
else
npm install -g mycli
fi
mycli --version
Make it executable: chmod +x cliwatch/install.sh
Task Files
Tasks are the core of your benchmark. Each task gives the agent an intent (what to do) and assertions (how to verify it worked).
cliwatch/tasks/01-discovery.yaml
Discovery tasks test whether the agent can find and list resources. These are usually easy, good starting points.
# Discovery: can the agent find things?
- id: list-resources
intent: "List all resources in my account."
difficulty: easy
category: discovery
max_turns: 3
assert:
- ran: "mycli"
- exit_code: 0
- id: list-filtered
intent: "Show me only the active resources."
difficulty: medium
category: discovery
max_turns: 5
assert:
- ran: "mycli"
- id: export-list
intent: "Export all resources as JSON to resources.json."
difficulty: medium
category: discovery
max_turns: 5
assert:
- ran: "mycli"
- file_exists: "resources.json"
cliwatch/tasks/02-actions.yaml
Action tasks test whether the agent can create, modify, or delete resources.
# Actions: can the agent do things?
- id: create-resource
intent: "Create a new resource called bench-test."
difficulty: medium
category: actions
max_turns: 5
assert:
- ran: "mycli"
- exit_code: 0
- id: update-resource
intent: "Update the resource bench-test with the tag environment=staging."
difficulty: medium
category: actions
max_turns: 5
assert:
- ran: "mycli"
- id: show-details
intent: "Show me the full details of the resource bench-test."
difficulty: easy
category: actions
max_turns: 3
assert:
- ran: "mycli"
- exit_code: 0
cliwatch/tasks/03-workflows.yaml
Workflow tasks combine multiple steps. These are harder and test whether the agent can chain commands.
# Workflows: multi-step tasks
- id: create-and-verify
intent: >
Create a resource called bench-workflow, then verify it exists
by listing all resources and confirming bench-workflow appears.
difficulty: hard
category: workflows
max_turns: 8
assert:
- ran: "mycli"
- exit_code: 0
- id: export-and-count
intent: >
Export all resources as JSON, save to bench-export.json,
then count how many resources there are.
difficulty: hard
category: workflows
max_turns: 8
assert:
- file_exists: "bench-export.json"
Task Design Tips
- Start easy. Begin with
--helpand--versiontasks to establish a baseline. - Use natural language. Write intents the way a developer would ask, not as CLI commands.
- Categories help. Group tasks by what they test (discovery, actions, workflows) for better reporting.
- Difficulty matters. easy (1-2 turns), medium (3-5 turns), hard (5+ turns).
- Assertions verify. Use
ran:regex to check the right command was called,exit_code:for success,file_exists:for output files.
See Writing Effective Tasks and Assertions for more.
GitHub Actions Workflow
.github/workflows/cliwatch.yml
name: CLI Benchmarks
on:
pull_request:
push:
branches: [main]
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 22
- name: Install CLI
run: bash cliwatch/install.sh
- name: Run benchmarks
run: npx @cliwatch/cli-bench --config cliwatch/cli-bench.yaml --upload
env:
AI_GATEWAY_API_KEY: ${{ secrets.AI_GATEWAY_API_KEY }}
CLIWATCH_API_KEY: ${{ secrets.CLIWATCH_API_KEY }}
Required secrets:
| Secret | Where to get it |
|---|---|
CLIWATCH_API_KEY | app.cliwatch.com > API Keys |
AI_GATEWAY_API_KEY | Your LLM provider (Anthropic, OpenAI, etc.) |
Running Locally
# Install cli-bench
npm install -g @cliwatch/cli-bench
# Run without uploading (local only)
npx @cliwatch/cli-bench --config cliwatch/cli-bench.yaml
# Run and upload results to dashboard
npx @cliwatch/cli-bench --config cliwatch/cli-bench.yaml --upload
Next Steps
- YAML Reference for all config options
- Assertions for all 10 assertion types
- Providers & Models for available LLM models
- GitHub Actions for CI setup details