Skip to main content

CLIWatch Docs

Agent-readiness testing for CLIs — benchmark how well AI coding agents can use your command-line tool, catch regressions in CI, and get PR comments with pass rates.

Quick Start

Option 1: Let your AI assistant set it up

Paste this into Claude Code, Cursor, or Codex:

Install @cliwatch/cli globally, then run cliwatch skills to read the setup docs. Use cliwatch init --ci to scaffold the benchmark config and a GitHub Actions workflow. Make sure CLIWATCH_API_KEY and AI_GATEWAY_API_KEY are set as GitHub secrets so results upload and I get PR comments.

Option 2: Set up manually

1. Create a task suite

Create a cli-bench.yaml in your project root (next to package.json). This is where cli-bench looks when you run it:

cli: mycli
version_command: "mycli --version"

tasks:
- id: show-help
intent: "Show the help information for mycli"
difficulty: easy
assert:
- exit_code: 0
- output_contains: "usage"

- id: create-project
intent: "Create a new project called my-app"
difficulty: medium
assert:
- exit_code: 0
- file_exists: "my-app/package.json"

See the full cli-bench.yaml Reference for all options.

2. Run the benchmark

npm install -g @cliwatch/cli-bench

export AI_GATEWAY_API_KEY="vck_..."
export CLIWATCH_API_KEY="cw_..."

# Run from the directory containing cli-bench.yaml
cli-bench --upload

All model calls go through the Vercel AI Gateway — one key for all providers. Create a CLIWatch API key at app.cliwatch.com.

3. Add to CI

Create .github/workflows/cliwatch.yml:

name: CLIWatch Benchmarks
on:
pull_request:
push:
branches: [main]

jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 22
- run: npm install -g @cliwatch/cli-bench
- run: cli-bench
env:
AI_GATEWAY_API_KEY: ${{ secrets.AI_GATEWAY_API_KEY }}
CLIWATCH_API_KEY: ${{ secrets.CLIWATCH_API_KEY }}

See the full GitHub Actions guide for caching, thresholds, and PR comments.

4. View results

Results appear at app.cliwatch.com with:

  • Pass rate matrix — which tasks pass on which models
  • Trend charts — track pass rates across releases
  • PR comments — benchmark results posted on every pull request

What's Next?