Providers & Models
CLIWatch uses the Vercel AI Gateway to route all model calls. This means you need a single API key (AI_GATEWAY_API_KEY) to access models from any provider, with no per-provider keys required.
Setup
Set your AI Gateway API key:
export AI_GATEWAY_API_KEY="vck_..."
In CI, add AI_GATEWAY_API_KEY as a repository secret. One key for all models.
Model Format
Models use the provider/model-id format in your cli-bench.yaml:
providers:
- anthropic/claude-sonnet-4.6
- openai/gpt-5.2
- google/gemini-3-flash
Popular Models
These are some commonly used models, but any model supported by the Vercel AI Gateway can be used. Just pass the full provider/model-id.
Anthropic
| Model ID | Description |
|---|---|
anthropic/claude-opus-4.6 | Claude Opus 4.6, frontier performance |
anthropic/claude-sonnet-4.6 | Claude Sonnet 4.6, balanced performance and cost |
anthropic/claude-haiku-4.5 | Claude Haiku 4.5, fast and cost-effective |
OpenAI
| Model ID | Description |
|---|---|
openai/gpt-5.2 | GPT-5.2, frontier model |
Google
| Model ID | Description |
|---|---|
google/gemini-3-pro | Gemini 3 Pro |
google/gemini-3-flash | Gemini 3 Flash, fast and capable |
google/gemini-2.5-flash | Gemini 2.5 Flash |
Using Any Gateway Model
You are not limited to the models above. Any model available through the Vercel AI Gateway works:
providers:
- anthropic/claude-opus-4.6
- openai/gpt-5.2
- google/gemini-3-flash
- meta/llama-3.1-8b
Unknown model IDs are passed through to the gateway as-is. If the gateway supports it, cli-bench will use it.
Comparing Multiple Models
Test with multiple providers to compare LLM performance on your CLI:
providers:
- anthropic/claude-sonnet-4.6
- openai/gpt-5.2
- google/gemini-3-flash
Each model runs all tasks independently. Results are grouped by model in the dashboard.
Per-Model Thresholds
Set different pass rate requirements per model:
providers:
- anthropic/claude-sonnet-4.6
- openai/gpt-5.2
thresholds:
default: 80
models:
anthropic/claude-sonnet-4.6: 90
openai/gpt-5.2: 70
See Thresholds & Tolerance for details.
Tips
- Start with one model, add more once your tasks are stable
- Use
cliwatch validateto check your config before running - Use
--dry-runto test your prompt without an API key - Different models may need different difficulty calibrations
- Compare models on the dashboard at app.cliwatch.com