Skip to main content

CLI Reference

CLIWatch ships two npm packages:

  • @cliwatch/cli: dashboard companion. Scaffolds configs, validates task suites, views runs and trends, and embeds skills documentation for AI assistants.
  • @cliwatch/cli-bench: benchmark runner. Executes your task suite against LLMs and uploads results.

Install them globally:

npm install -g @cliwatch/cli @cliwatch/cli-bench

cliwatch bench delegates to cli-bench; you can use either.


cliwatch Commands

cliwatch init

Scaffold a cli-bench.yaml config and optional GitHub Actions workflow.

cliwatch init [--cli NAME] [--ci] [--dir PATH]
FlagDescription
--cli <name>CLI name (skips interactive prompt, useful for AI assistants)
--ciAlso generate .github/workflows/cliwatch.yml (requires --cli)
--dir <path>Target directory (default: current directory)

Auto-detects CLI name from package.json, Cargo.toml, go.mod, or pyproject.toml.

cliwatch validate

Validate a cli-bench.yaml config file without running benchmarks.

cliwatch validate [--file PATH]
FlagDescription
--file <path>Path to config file (default: cli-bench.yaml in current directory)

Checks:

  • YAML syntax and schema validity
  • No duplicate task IDs
  • Valid regex patterns in ran, not_ran, and run_count assertions
  • CLI found on PATH (warning if missing)
  • AI_GATEWAY_API_KEY set (warning if missing)

No API key needed. Exit code 1 on errors, 0 on warnings-only.

cliwatch bench

Run benchmarks by delegating to @cliwatch/cli-bench.

cliwatch bench [--dry-run] [--upload] [--models ...] [...]

All flags are passed through to cli-bench. See cli-bench Flags below.

cliwatch runs

List recent runs or view a specific run.

# List recent runs
cliwatch runs [--cli NAME] [--json]

# View run detail
cliwatch runs <RUN_ID> [--json]
FlagDescription
<RUN_ID>Show detailed results for a specific run
--cli <name>Filter runs by CLI name
--jsonOutput raw JSON

Requires CLIWATCH_API_KEY.

cliwatch trend

Show pass rate trends across recent runs.

cliwatch trend --cli NAME [--json]
FlagDescription
--cli <name>CLI name (required)
--jsonOutput raw JSON

Displays a table of pass rates per model across runs, with trend arrows (▲ improving, ▼ regressing, ─ stable).

Requires CLIWATCH_API_KEY.

cliwatch skills

Browse embedded documentation topics designed for AI coding assistants.

# List all topics
cliwatch skills [--json]

# View a topic
cliwatch skills <TOPIC> [--json]
TopicDescription
setup-testsHow to write effective CLI benchmark tests
write-tasksTask structure, intents, and difficulty guidelines
assertionsAll 10 assertion types with examples
ci-integrationGitHub Actions and GitLab CI setup
yaml-referenceComplete cli-bench.yaml config reference
providersSupported LLM providers and model IDs
thresholdsPass rate thresholds and tolerance
help-modesHow context modes work
troubleshootingCommon issues and debugging
tip

Paste cliwatch skills output directly into your AI assistant's context for informed help with config and task writing.


cli-bench Flags

These flags work with both cli-bench and cliwatch bench:

FlagDescriptionDefault
--config <path>Path to config fileAuto-discovers cli-bench.yaml
--filter <ids>Comma-separated task IDs to runAll tasks
--models <models>Comma-separated model IDs to useAll in config
--tags <tags>Comma-separated tags to filter tasksAll tasks
--concurrency <n>Max concurrent API calls3
--workdir <dir>Working directory for commandsTemporary directory
--repeat <n>Run each task N times (1-100) for statistical confidence1
--output <file>Write JSON GridReport to file-
--uploadUpload results to CLIWatch dashboardOff
--github-comment <file>Write PR comment markdown to file-
--dry-runPrint prompt for first task without calling any APIOff
--helpShow help message-

Examples

# Run only specific tasks with one model
cli-bench --filter show-help,create-project --models anthropic/claude-sonnet-4-5-20250929

# Run only smoke-tagged tasks
cli-bench --tags smoke --upload

# Preview the prompt without spending API credits
cli-bench --dry-run

# Run each task 5 times for statistical confidence
cli-bench --repeat 5 --upload

# Write PR comment markdown for CI
cli-bench --upload --github-comment pr-comment.md
caution

Tasks with non-idempotent setup commands may collide when using --repeat. Each repeat gets a fresh temp directory, but shared resources (databases, network ports) are not isolated.


Global Flags

These flags work with any cliwatch command:

FlagDescriptionDefault
--api-key <key>CLIWatch API keyCLIWATCH_API_KEY env var
--cli <name>CLI name (for filtering)-
--jsonOutput JSON instead of formatted textOff
--help, -hShow help-
--version, -vShow version-

Environment Variables

VariableUsed byDescription
CLIWATCH_API_KEYBothAuthenticates with the CLIWatch API (cw_ prefix)
AI_GATEWAY_API_KEYcli-benchAuthenticates LLM calls through Vercel AI Gateway (vck_ prefix)