CLI Reference

CLIWatch ships two npm packages:

@cliwatch/cli: dashboard companion. Scaffolds configs, validates task suites, views runs and trends, and embeds skills documentation for AI assistants.
@cliwatch/cli-bench: benchmark runner. Executes your task suite against LLMs and uploads results.

Install them globally:

npm install -g @cliwatch/cli @cliwatch/cli-bench

cliwatch bench delegates to cli-bench; you can use either.

cliwatch Commands

`cliwatch init`

Scaffold a cli-bench.yaml config and optional GitHub Actions workflow.

cliwatch init [--cli NAME] [--ci] [--dir PATH]

Flag	Description
`--cli <name>`	CLI name (skips interactive prompt, useful for AI assistants)
`--ci`	Also generate `.github/workflows/cliwatch.yml` (requires `--cli`)
`--dir <path>`	Target directory (default: current directory)

Auto-detects CLI name from package.json, Cargo.toml, go.mod, or pyproject.toml.

`cliwatch validate`

Validate a cli-bench.yaml config file without running benchmarks.

cliwatch validate [--file PATH]

Flag	Description
`--file <path>`	Path to config file (default: `cli-bench.yaml` in current directory)

Checks:

YAML syntax and schema validity
No duplicate task IDs
Valid regex patterns in ran, not_ran, and run_count assertions
CLI found on PATH (warning if missing)
AI_GATEWAY_API_KEY set (warning if missing)

No API key needed. Exit code 1 on errors, 0 on warnings-only.

`cliwatch bench`

Run benchmarks by delegating to @cliwatch/cli-bench.

cliwatch bench [--dry-run] [--upload] [--models ...] [...]

All flags are passed through to cli-bench. See cli-bench Flags below.

`cliwatch runs`

List recent runs or view a specific run.

# List recent runs
cliwatch runs [--cli NAME] [--json]

# View run detail
cliwatch runs <RUN_ID> [--json]

Flag	Description
`<RUN_ID>`	Show detailed results for a specific run
`--cli <name>`	Filter runs by CLI name
`--json`	Output raw JSON

Requires CLIWATCH_API_KEY.

`cliwatch trend`

Show pass rate trends across recent runs.

cliwatch trend --cli NAME [--json]

Flag	Description
`--cli <name>`	CLI name (required)
`--json`	Output raw JSON

Displays a table of pass rates per model across runs, with trend arrows (▲ improving, ▼ regressing, ─ stable).

Requires CLIWATCH_API_KEY.

`cliwatch skills`

Browse embedded documentation topics designed for AI coding assistants.

# List all topics
cliwatch skills [--json]

# View a topic
cliwatch skills <TOPIC> [--json]

Topic	Description
`setup-tests`	How to write effective CLI benchmark tests
`write-tasks`	Task structure, intents, and difficulty guidelines
`assertions`	All 10 assertion types with examples
`ci-integration`	GitHub Actions and GitLab CI setup
`yaml-reference`	Complete cli-bench.yaml config reference
`providers`	Supported LLM providers and model IDs
`thresholds`	Pass rate thresholds and tolerance
`help-modes`	How context modes work
`troubleshooting`	Common issues and debugging

tip

Paste cliwatch skills output directly into your AI assistant's context for informed help with config and task writing.

cli-bench Flags

These flags work with both cli-bench and cliwatch bench:

Flag	Description	Default
`--config <path>`	Path to config file	Auto-discovers `cli-bench.yaml`
`--filter <ids>`	Comma-separated task IDs to run	All tasks
`--models <models>`	Comma-separated model IDs to use	All in config
`--tags <tags>`	Comma-separated tags to filter tasks	All tasks
`--concurrency <n>`	Max concurrent API calls	`3`
`--workdir <dir>`	Working directory for commands	Temporary directory
`--repeat <n>`	Run each task N times (1-100) for statistical confidence	`1`
`--output <file>`	Write JSON GridReport to file	-
`--upload`	Upload results to CLIWatch dashboard	Off
`--github-comment <file>`	Write PR comment markdown to file	-
`--dry-run`	Print prompt for first task without calling any API	Off
`--help`	Show help message	-

Examples

# Run only specific tasks with one model
cli-bench --filter show-help,create-project --models anthropic/claude-sonnet-4-5-20250929

# Run only smoke-tagged tasks
cli-bench --tags smoke --upload

# Preview the prompt without spending API credits
cli-bench --dry-run

# Run each task 5 times for statistical confidence
cli-bench --repeat 5 --upload

# Write PR comment markdown for CI
cli-bench --upload --github-comment pr-comment.md

caution

Tasks with non-idempotent setup commands may collide when using --repeat. Each repeat gets a fresh temp directory, but shared resources (databases, network ports) are not isolated.

Global Flags

These flags work with any cliwatch command:

Flag	Description	Default
`--api-key <key>`	CLIWatch API key	`CLIWATCH_API_KEY` env var
`--cli <name>`	CLI name (for filtering)	-
`--json`	Output JSON instead of formatted text	Off
`--help`, `-h`	Show help	-
`--version`, `-v`	Show version	-

Environment Variables

Variable	Used by	Description
`CLIWATCH_API_KEY`	Both	Authenticates with the CLIWatch API (`cw_` prefix)
`AI_GATEWAY_API_KEY`	cli-bench	Authenticates LLM calls through Vercel AI Gateway (`vck_` prefix)

cliwatch Commands​

cliwatch init​

cliwatch validate​

cliwatch bench​

cliwatch runs​

cliwatch trend​

cliwatch skills​

cli-bench Flags​

Examples​

Global Flags​

Environment Variables​