Providers & Models

CLIWatch uses the Vercel AI Gateway to route all model calls. This means you need a single API key (AI_GATEWAY_API_KEY) to access models from any provider — no per-provider keys required.

Setup

Set your AI Gateway API key:

export AI_GATEWAY_API_KEY="vck_..."

In CI, add AI_GATEWAY_API_KEY as a repository secret. That's it — one key for all models.

Model Format

Models use the provider/model-id format in your cli-bench.yaml:

providers:
  - anthropic/claude-sonnet-4-20250514
  - openai/gpt-4o
  - google/gemini-2.5-flash

Popular Models

These are some commonly used models, but any model supported by the Vercel AI Gateway can be used — just pass the full provider/model-id.

Anthropic

Model ID	Description
`anthropic/claude-sonnet-4-20250514`	Claude Sonnet 4 — balanced performance and cost
`anthropic/claude-haiku-4-5-20251001`	Claude Haiku 4.5 — fast and cost-effective

OpenAI

Model ID	Description
`openai/gpt-4o`	GPT-4o — flagship model
`openai/gpt-4o-mini`	GPT-4o Mini — smaller, faster, cheaper

Google

Model ID	Description
`google/gemini-2.5-pro`	Gemini 2.5 Pro — highest capability
`google/gemini-2.5-flash`	Gemini 2.5 Flash — fast and capable

Model ID	Description
`meta/llama-3.1-8b`	Llama 3.1 8B — open-source

Mistral

Model ID	Description
`mistral/ministral-3b`	Ministral 3B — lightweight

Using Any Gateway Model

You're not limited to the models above. Any model available through the Vercel AI Gateway works — just use its provider/model-id:

providers:
  - anthropic/claude-opus-4-6
  - openai/o3-mini
  - google/gemini-2.0-flash
  - meta/llama-3.3-70b

Unknown model IDs are passed through to the gateway as-is. If the gateway supports it, cli-bench will use it.

Comparing Multiple Models

Test with multiple providers to compare LLM performance on your CLI:

providers:
  - anthropic/claude-sonnet-4-20250514
  - openai/gpt-4o
  - google/gemini-2.5-flash

Each model runs all tasks independently. Results are grouped by model in the dashboard.

Per-Model Thresholds

Set different pass rate requirements per model:

providers:
  - anthropic/claude-sonnet-4-20250514
  - openai/gpt-4o-mini

thresholds:
  default: 80
  models:
    anthropic/claude-sonnet-4-20250514: 90
    openai/gpt-4o-mini: 70

See Thresholds & Tolerance for details.

Tips

Start with one model, add more once your tasks are stable
Use cliwatch validate to check your config before running
Use --dry-run to test your prompt without an API key
If no providers are specified, defaults to anthropic/claude-sonnet-4-20250514
Different models may need different difficulty calibrations
Compare models on the dashboard at app.cliwatch.com

Setup​

Model Format​

Popular Models​

Anthropic​

OpenAI​

Google​

Meta​

Mistral​

Using Any Gateway Model​

Comparing Multiple Models​

Per-Model Thresholds​

Tips​