GitHub Actions
Run CLIWatch benchmarks on every PR and push to main.
Basic Workflow
Create .github/workflows/cliwatch.yml:
name: CLIWatch Benchmarks
on:
pull_request:
push:
branches: [main]
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 22
- run: npm install -g @cliwatch/cli-bench
- run: cli-bench
env:
AI_GATEWAY_API_KEY: ${{ secrets.AI_GATEWAY_API_KEY }}
CLIWATCH_API_KEY: ${{ secrets.CLIWATCH_API_KEY }}
Environment Variables
Add these as repository secrets in Settings → Secrets and variables → Actions:
| Variable | Required | Description |
|---|---|---|
CLIWATCH_API_KEY | Yes | Your CLIWatch API key for uploading results |
AI_GATEWAY_API_KEY | Yes | Vercel AI Gateway key (provides access to all models) |
All model calls go through the Vercel AI Gateway, so you only need one API key for all providers.
PR Comments via GitHub App
CLIWatch can automatically post eval results as comments on your pull requests.
Setup
- Go to app.cliwatch.com, then Settings > General
- Click Connect GitHub and authorize the CLIWatch GitHub App
- Select the repositories you want to enable
Once connected, the app posts a comment on every PR that triggers an eval run.
What PR Comments Show
- Pass rate summary: overall pass rate for the run, with comparison to main branch
- Regressions: tasks that passed on main but fail on the PR branch
- Per-model breakdown: pass rates for each model tested
- Link to dashboard: click through to the full run detail with conversation traces
PR comments update automatically if you push new commits to the same PR.
Threshold-Based CI Gating
Use thresholds to fail CI when pass rates drop:
# cli-bench.yaml
thresholds:
default: 80
tolerance: 5
behavior: error
models:
anthropic/claude-sonnet-4.6: 90
google/gemini-3-flash: 70
With this config, the CI job exits with code 1 if thresholds are violated.
With Build Step
If your CLI needs to be built before benchmarking:
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 22
cache: npm
- run: npm ci
- run: npm run build
- run: npm install -g @cliwatch/cli-bench
- run: cli-bench
env:
AI_GATEWAY_API_KEY: ${{ secrets.AI_GATEWAY_API_KEY }}
CLIWATCH_API_KEY: ${{ secrets.CLIWATCH_API_KEY }}
PR Comment via --github-comment
If you prefer a lightweight approach without installing the GitHub App, use --github-comment to write a markdown summary to a file, then post it with gh:
jobs:
benchmark:
runs-on: ubuntu-latest
permissions:
pull-requests: write
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 22
- run: npm install -g @cliwatch/cli-bench
- run: cli-bench --github-comment pr-comment.md
env:
AI_GATEWAY_API_KEY: ${{ secrets.AI_GATEWAY_API_KEY }}
CLIWATCH_API_KEY: ${{ secrets.CLIWATCH_API_KEY }}
- name: Post PR comment
if: github.event_name == 'pull_request' && hashFiles('pr-comment.md') != ''
run: gh pr comment ${{ github.event.pull_request.number }} --body-file pr-comment.md
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
The --github-comment flag is skipped in --dry-run mode.
Tips
- Run benchmarks on PRs to catch regressions before merge
- Use
upload: auto(default): results upload whenCLIWATCH_API_KEYis set - Set
concurrency: 1in CI if your CLI modifies shared state - Use
--dry-runto test your config without running the LLM - Use
--tags smoketo run a fast subset of tasks in CI