Skip to main content

CLIWatch Docs

Agent-readiness testing for CLIs. Benchmark how well AI coding agents can use your command-line tool, catch regressions in CI, and get PR comments with pass rates.

Quick Start

Option 1: Let your AI assistant set it up

Paste this into Claude Code, Cursor, or Codex:

Install @cliwatch/cli globally, then run cliwatch skills setup to read the setup guide. Follow the steps to scaffold the benchmark config and a GitHub Actions workflow. Make sure CLIWATCH_API_KEY and AI_GATEWAY_API_KEY are set as GitHub secrets so results upload and I get PR comments.

Option 2: Set up manually

1. Create a task suite

Create a cliwatch/ folder in your project with a cli-bench.yaml config and task files:

your-project/
├── cliwatch/
│ ├── cli-bench.yaml
│ └── tasks/
│ └── 01-basics.yaml
cliwatch/cli-bench.yaml
cli: mycli
version_command: "mycli --version"

providers:
- anthropic/claude-haiku-4.5

tasks:
- "file://tasks/*.yaml"
cliwatch/tasks/01-basics.yaml
- id: show-help
intent: "Show the help information for mycli"
difficulty: easy
category: basics
max_turns: 2
assert:
- exit_code: 0
- output_contains: "usage"

See the Example Benchmark Suite for a complete, copy-paste-ready project, or the full YAML Reference for all options.

2. Run the benchmark

npm install -g @cliwatch/cli-bench

export AI_GATEWAY_API_KEY="vck_..."
export CLIWATCH_API_KEY="cw_..."

cli-bench --config cliwatch/cli-bench.yaml --upload

All model calls go through the Vercel AI Gateway, so you only need one key for all providers. Create a CLIWatch API key at app.cliwatch.com.

3. Add to CI

Create .github/workflows/cliwatch.yml:

name: CLIWatch Benchmarks
on:
pull_request:
push:
branches: [main]

jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 22
- run: npm install -g @cliwatch/cli-bench
- run: cli-bench --config cliwatch/cli-bench.yaml --upload
env:
AI_GATEWAY_API_KEY: ${{ secrets.AI_GATEWAY_API_KEY }}
CLIWATCH_API_KEY: ${{ secrets.CLIWATCH_API_KEY }}

See the full GitHub Actions guide for caching, thresholds, and PR comments.

4. View results

Results appear at app.cliwatch.com with:

  • Pass rate matrix: which tasks pass on which models
  • Trend charts: track pass rates across releases
  • PR comments: benchmark results posted on every pull request

What's Next?