Skip to main content

Providers & Models

CLIWatch uses the Vercel AI Gateway to route all model calls. This means you need a single API key (AI_GATEWAY_API_KEY) to access models from any provider, with no per-provider keys required.

Setup

Set your AI Gateway API key:

export AI_GATEWAY_API_KEY="vck_..."

In CI, add AI_GATEWAY_API_KEY as a repository secret. One key for all models.

Model Format

Models use the provider/model-id format in your cli-bench.yaml:

providers:
- anthropic/claude-sonnet-4.6
- openai/gpt-5.2
- google/gemini-3-flash

These are some commonly used models, but any model supported by the Vercel AI Gateway can be used. Just pass the full provider/model-id.

Anthropic

Model IDDescription
anthropic/claude-opus-4.6Claude Opus 4.6, frontier performance
anthropic/claude-sonnet-4.6Claude Sonnet 4.6, balanced performance and cost
anthropic/claude-haiku-4.5Claude Haiku 4.5, fast and cost-effective

OpenAI

Model IDDescription
openai/gpt-5.2GPT-5.2, frontier model

Google

Model IDDescription
google/gemini-3-proGemini 3 Pro
google/gemini-3-flashGemini 3 Flash, fast and capable
google/gemini-2.5-flashGemini 2.5 Flash

Using Any Gateway Model

You are not limited to the models above. Any model available through the Vercel AI Gateway works:

providers:
- anthropic/claude-opus-4.6
- openai/gpt-5.2
- google/gemini-3-flash
- meta/llama-3.1-8b

Unknown model IDs are passed through to the gateway as-is. If the gateway supports it, cli-bench will use it.

Comparing Multiple Models

Test with multiple providers to compare LLM performance on your CLI:

providers:
- anthropic/claude-sonnet-4.6
- openai/gpt-5.2
- google/gemini-3-flash

Each model runs all tasks independently. Results are grouped by model in the dashboard.

Per-Model Thresholds

Set different pass rate requirements per model:

providers:
- anthropic/claude-sonnet-4.6
- openai/gpt-5.2

thresholds:
default: 80
models:
anthropic/claude-sonnet-4.6: 90
openai/gpt-5.2: 70

See Thresholds & Tolerance for details.

Tips

  • Start with one model, add more once your tasks are stable
  • Use cliwatch validate to check your config before running
  • Use --dry-run to test your prompt without an API key
  • Different models may need different difficulty calibrations
  • Compare models on the dashboard at app.cliwatch.com