CLIWatch Docs
Agent-readiness testing for CLIs. Benchmark how well AI coding agents can use your command-line tool, catch regressions in CI, and get PR comments with pass rates.
Quick Start
Option 1: Let your AI assistant set it up
Paste this into Claude Code, Cursor, or Codex:
Install
@cliwatch/cliglobally, then runcliwatch skills setupto read the setup guide. Follow the steps to scaffold the benchmark config and a GitHub Actions workflow. Make sureCLIWATCH_API_KEYandAI_GATEWAY_API_KEYare set as GitHub secrets so results upload and I get PR comments.
Option 2: Set up manually
1. Create a task suite
Create a cliwatch/ folder in your project with a cli-bench.yaml config and task files:
your-project/
├── cliwatch/
│ ├── cli-bench.yaml
│ └── tasks/
│ └── 01-basics.yaml
cli: mycli
version_command: "mycli --version"
providers:
- anthropic/claude-haiku-4.5
tasks:
- "file://tasks/*.yaml"
- id: show-help
intent: "Show the help information for mycli"
difficulty: easy
category: basics
max_turns: 2
assert:
- exit_code: 0
- output_contains: "usage"
See the Example Benchmark Suite for a complete, copy-paste-ready project, or the full YAML Reference for all options.
2. Run the benchmark
npm install -g @cliwatch/cli-bench
export AI_GATEWAY_API_KEY="vck_..."
export CLIWATCH_API_KEY="cw_..."
cli-bench --config cliwatch/cli-bench.yaml --upload
All model calls go through the Vercel AI Gateway, so you only need one key for all providers. Create a CLIWatch API key at app.cliwatch.com.
3. Add to CI
Create .github/workflows/cliwatch.yml:
name: CLIWatch Benchmarks
on:
pull_request:
push:
branches: [main]
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 22
- run: npm install -g @cliwatch/cli-bench
- run: cli-bench --config cliwatch/cli-bench.yaml --upload
env:
AI_GATEWAY_API_KEY: ${{ secrets.AI_GATEWAY_API_KEY }}
CLIWATCH_API_KEY: ${{ secrets.CLIWATCH_API_KEY }}
See the full GitHub Actions guide for caching, thresholds, and PR comments.
4. View results
Results appear at app.cliwatch.com with:
- Pass rate matrix: which tasks pass on which models
- Trend charts: track pass rates across releases
- PR comments: benchmark results posted on every pull request
What's Next?
- Account Setup: get your API keys and configure CI secrets
- CLI Reference: all commands and flags for
cliwatchandcli-bench - cli-bench.yaml Reference: full config schema
- Assertions: all 10 assertion types
- Providers & Models: supported LLMs and model IDs
- GitHub Actions: CI setup with thresholds and PR comments
- Writing Effective Tasks: tips for good intents and test strategy
- Reading the Dashboard: interpret pass rates, grades, and conversation traces
- Debugging Benchmarks: reproduce failures, read assertions, fix flaky tasks
- Troubleshooting: common issues and debugging