Providers & Models
CLIWatch uses the Vercel AI Gateway to route all model calls. This means you need a single API key (AI_GATEWAY_API_KEY) to access models from any provider — no per-provider keys required.
Setup
Set your AI Gateway API key:
export AI_GATEWAY_API_KEY="vck_..."
In CI, add AI_GATEWAY_API_KEY as a repository secret. That's it — one key for all models.
Model Format
Models use the provider/model-id format in your cli-bench.yaml:
providers:
- anthropic/claude-sonnet-4-20250514
- openai/gpt-4o
- google/gemini-2.5-flash
Popular Models
These are some commonly used models, but any model supported by the Vercel AI Gateway can be used — just pass the full provider/model-id.
Anthropic
| Model ID | Description |
|---|---|
anthropic/claude-sonnet-4-20250514 | Claude Sonnet 4 — balanced performance and cost |
anthropic/claude-haiku-4-5-20251001 | Claude Haiku 4.5 — fast and cost-effective |
OpenAI
| Model ID | Description |
|---|---|
openai/gpt-4o | GPT-4o — flagship model |
openai/gpt-4o-mini | GPT-4o Mini — smaller, faster, cheaper |
Google
| Model ID | Description |
|---|---|
google/gemini-2.5-pro | Gemini 2.5 Pro — highest capability |
google/gemini-2.5-flash | Gemini 2.5 Flash — fast and capable |
Meta
| Model ID | Description |
|---|---|
meta/llama-3.1-8b | Llama 3.1 8B — open-source |
Mistral
| Model ID | Description |
|---|---|
mistral/ministral-3b | Ministral 3B — lightweight |
Using Any Gateway Model
You're not limited to the models above. Any model available through the Vercel AI Gateway works — just use its provider/model-id:
providers:
- anthropic/claude-opus-4-6
- openai/o3-mini
- google/gemini-2.0-flash
- meta/llama-3.3-70b
Unknown model IDs are passed through to the gateway as-is. If the gateway supports it, cli-bench will use it.
Comparing Multiple Models
Test with multiple providers to compare LLM performance on your CLI:
providers:
- anthropic/claude-sonnet-4-20250514
- openai/gpt-4o
- google/gemini-2.5-flash
Each model runs all tasks independently. Results are grouped by model in the dashboard.
Per-Model Thresholds
Set different pass rate requirements per model:
providers:
- anthropic/claude-sonnet-4-20250514
- openai/gpt-4o-mini
thresholds:
default: 80
models:
anthropic/claude-sonnet-4-20250514: 90
openai/gpt-4o-mini: 70
See Thresholds & Tolerance for details.
Tips
- Start with one model, add more once your tasks are stable
- Use
cliwatch validateto check your config before running - Use
--dry-runto test your prompt without an API key - If no
providersare specified, defaults toanthropic/claude-sonnet-4-20250514 - Different models may need different difficulty calibrations
- Compare models on the dashboard at app.cliwatch.com