Skip to main content

Providers & Models

CLIWatch uses the Vercel AI Gateway to route all model calls. This means you need a single API key (AI_GATEWAY_API_KEY) to access models from any provider — no per-provider keys required.

Setup

Set your AI Gateway API key:

export AI_GATEWAY_API_KEY="vck_..."

In CI, add AI_GATEWAY_API_KEY as a repository secret. That's it — one key for all models.

Model Format

Models use the provider/model-id format in your cli-bench.yaml:

providers:
- anthropic/claude-sonnet-4-20250514
- openai/gpt-4o
- google/gemini-2.5-flash

These are some commonly used models, but any model supported by the Vercel AI Gateway can be used — just pass the full provider/model-id.

Anthropic

Model IDDescription
anthropic/claude-sonnet-4-20250514Claude Sonnet 4 — balanced performance and cost
anthropic/claude-haiku-4-5-20251001Claude Haiku 4.5 — fast and cost-effective

OpenAI

Model IDDescription
openai/gpt-4oGPT-4o — flagship model
openai/gpt-4o-miniGPT-4o Mini — smaller, faster, cheaper

Google

Model IDDescription
google/gemini-2.5-proGemini 2.5 Pro — highest capability
google/gemini-2.5-flashGemini 2.5 Flash — fast and capable

Meta

Model IDDescription
meta/llama-3.1-8bLlama 3.1 8B — open-source

Mistral

Model IDDescription
mistral/ministral-3bMinistral 3B — lightweight

Using Any Gateway Model

You're not limited to the models above. Any model available through the Vercel AI Gateway works — just use its provider/model-id:

providers:
- anthropic/claude-opus-4-6
- openai/o3-mini
- google/gemini-2.0-flash
- meta/llama-3.3-70b

Unknown model IDs are passed through to the gateway as-is. If the gateway supports it, cli-bench will use it.

Comparing Multiple Models

Test with multiple providers to compare LLM performance on your CLI:

providers:
- anthropic/claude-sonnet-4-20250514
- openai/gpt-4o
- google/gemini-2.5-flash

Each model runs all tasks independently. Results are grouped by model in the dashboard.

Per-Model Thresholds

Set different pass rate requirements per model:

providers:
- anthropic/claude-sonnet-4-20250514
- openai/gpt-4o-mini

thresholds:
default: 80
models:
anthropic/claude-sonnet-4-20250514: 90
openai/gpt-4o-mini: 70

See Thresholds & Tolerance for details.

Tips

  • Start with one model, add more once your tasks are stable
  • Use cliwatch validate to check your config before running
  • Use --dry-run to test your prompt without an API key
  • If no providers are specified, defaults to anthropic/claude-sonnet-4-20250514
  • Different models may need different difficulty calibrations
  • Compare models on the dashboard at app.cliwatch.com