CLI Reference
CLIWatch ships two npm packages:
@cliwatch/cli: dashboard companion. Scaffolds configs, validates task suites, views runs and trends, and embeds skills documentation for AI assistants.@cliwatch/cli-bench: benchmark runner. Executes your task suite against LLMs and uploads results.
Install them globally:
npm install -g @cliwatch/cli @cliwatch/cli-bench
cliwatch bench delegates to cli-bench; you can use either.
cliwatch Commands
cliwatch init
Scaffold a cli-bench.yaml config and optional GitHub Actions workflow.
cliwatch init [--cli NAME] [--ci] [--dir PATH]
| Flag | Description |
|---|---|
--cli <name> | CLI name (skips interactive prompt, useful for AI assistants) |
--ci | Also generate .github/workflows/cliwatch.yml (requires --cli) |
--dir <path> | Target directory (default: current directory) |
Auto-detects CLI name from package.json, Cargo.toml, go.mod, or pyproject.toml.
cliwatch validate
Validate a cli-bench.yaml config file without running benchmarks.
cliwatch validate [--file PATH]
| Flag | Description |
|---|---|
--file <path> | Path to config file (default: cli-bench.yaml in current directory) |
Checks:
- YAML syntax and schema validity
- No duplicate task IDs
- Valid regex patterns in
ran,not_ran, andrun_countassertions - CLI found on PATH (warning if missing)
AI_GATEWAY_API_KEYset (warning if missing)
No API key needed. Exit code 1 on errors, 0 on warnings-only.
cliwatch bench
Run benchmarks by delegating to @cliwatch/cli-bench.
cliwatch bench [--dry-run] [--upload] [--models ...] [...]
All flags are passed through to cli-bench. See cli-bench Flags below.
cliwatch runs
List recent runs or view a specific run.
# List recent runs
cliwatch runs [--cli NAME] [--json]
# View run detail
cliwatch runs <RUN_ID> [--json]
| Flag | Description |
|---|---|
<RUN_ID> | Show detailed results for a specific run |
--cli <name> | Filter runs by CLI name |
--json | Output raw JSON |
Requires CLIWATCH_API_KEY.
cliwatch trend
Show pass rate trends across recent runs.
cliwatch trend --cli NAME [--json]
| Flag | Description |
|---|---|
--cli <name> | CLI name (required) |
--json | Output raw JSON |
Displays a table of pass rates per model across runs, with trend arrows (▲ improving, ▼ regressing, ─ stable).
Requires CLIWATCH_API_KEY.
cliwatch skills
Browse embedded documentation topics designed for AI coding assistants.
# List all topics
cliwatch skills [--json]
# View a topic
cliwatch skills <TOPIC> [--json]
| Topic | Description |
|---|---|
setup-tests | How to write effective CLI benchmark tests |
write-tasks | Task structure, intents, and difficulty guidelines |
assertions | All 10 assertion types with examples |
ci-integration | GitHub Actions and GitLab CI setup |
yaml-reference | Complete cli-bench.yaml config reference |
providers | Supported LLM providers and model IDs |
thresholds | Pass rate thresholds and tolerance |
help-modes | How context modes work |
troubleshooting | Common issues and debugging |
Paste cliwatch skills output directly into your AI assistant's context for informed help with config and task writing.
cli-bench Flags
These flags work with both cli-bench and cliwatch bench:
| Flag | Description | Default |
|---|---|---|
--config <path> | Path to config file | Auto-discovers cli-bench.yaml |
--filter <ids> | Comma-separated task IDs to run | All tasks |
--models <models> | Comma-separated model IDs to use | All in config |
--tags <tags> | Comma-separated tags to filter tasks | All tasks |
--concurrency <n> | Max concurrent API calls | 3 |
--workdir <dir> | Working directory for commands | Temporary directory |
--repeat <n> | Run each task N times (1-100) for statistical confidence | 1 |
--output <file> | Write JSON GridReport to file | - |
--upload | Upload results to CLIWatch dashboard | Off |
--github-comment <file> | Write PR comment markdown to file | - |
--dry-run | Print prompt for first task without calling any API | Off |
--help | Show help message | - |
Examples
# Run only specific tasks with one model
cli-bench --filter show-help,create-project --models anthropic/claude-sonnet-4-5-20250929
# Run only smoke-tagged tasks
cli-bench --tags smoke --upload
# Preview the prompt without spending API credits
cli-bench --dry-run
# Run each task 5 times for statistical confidence
cli-bench --repeat 5 --upload
# Write PR comment markdown for CI
cli-bench --upload --github-comment pr-comment.md
Tasks with non-idempotent setup commands may collide when using --repeat. Each repeat gets a fresh temp directory, but shared resources (databases, network ports) are not isolated.
Global Flags
These flags work with any cliwatch command:
| Flag | Description | Default |
|---|---|---|
--api-key <key> | CLIWatch API key | CLIWATCH_API_KEY env var |
--cli <name> | CLI name (for filtering) | - |
--json | Output JSON instead of formatted text | Off |
--help, -h | Show help | - |
--version, -v | Show version | - |
Environment Variables
| Variable | Used by | Description |
|---|---|---|
CLIWATCH_API_KEY | Both | Authenticates with the CLIWatch API (cw_ prefix) |
AI_GATEWAY_API_KEY | cli-bench | Authenticates LLM calls through Vercel AI Gateway (vck_ prefix) |