Debugging Benchmarks

When benchmarks fail (in CI or locally), this guide helps you find and fix the problem.

Reproducing CI Failures Locally

Open the failing run in the dashboard and note the model

Set your environment variables:

export AI_GATEWAY_API_KEY="vck_..."
export CLIWATCH_API_KEY="cw_..."

Run the specific failing combination:

cli-bench --models anthropic/claude-sonnet-4.6 --filter failing-task-id

Using --filter and --models narrows the run to exactly what failed, saving time and API credits.

Inspecting Prompts with --dry-run

The --dry-run flag prints the exact prompt that would be sent to the LLM, without making any API calls or requiring an AI_GATEWAY_API_KEY.

cli-bench --dry-run

The output shows the system message and user message for the first task. Check that:

The CLI name is correct
Your system_prompt content appears (if configured)
The task intent is clear and specific

Reading Assertion Failures

Each assertion type produces a different failure message. Here's what to look for:

Assertion	Failure shows	Common fix
`exit_code`	Expected vs. actual exit code (e.g., "expected 0, got 1")	Check if the CLI actually supports the command. Add `setup` to install dependencies.
`output_contains`	Expected substring vs. actual stdout	Broaden the match string. Check if the output format changed.
`output_equals`	Expected vs. actual full output	Use `output_contains` instead; exact matches are brittle.
`error_contains`	Expected substring vs. actual stderr	Verify the error message format hasn't changed.
`file_exists`	"not found" vs. expected path	Check working directory. The path is relative to the task workdir.
`file_contains`	Expected text vs. actual file contents	Broaden the match. Check if file format changed.
`ran`	Regex pattern vs. list of commands actually run	Broaden the regex to allow flag reordering (e.g., `git commit.*-m` not `git commit -m`).
`not_ran`	Regex pattern matched a command that shouldn't have been run	Tighten the regex to avoid false positives.
`run_count`	Actual count vs. expected range (e.g., "2 not in 3..*")	Adjust `min`/`max` or investigate why the command ran fewer/more times.
`verify`	Verification command output vs. expected	Run the verify command manually to check what it produces.

Using the Task Detail Dialog

The Task Detail Dialog is the primary debugging tool. Open it by clicking any cell in the task × model grid.

Navigating the Dialog

Pick a model: buttons across the top show each model with its pass/fail status. Click to switch.
Pick a repeat: if the task used repeat: N, numbered buttons (#0, #1, ...) appear below the model selector. Each has a colored ring (green = passed, red = failed). Click different repeats to compare what went differently.
Check stats: the stats row shows turns used, tokens consumed, and assertions passed out of total.
Read assertions (left panel): see which assertions passed and which failed, with expected vs. actual values.
Walk the trace (right panel): step through the LLM's reasoning and commands.
View task YAML: expand the task definition section to see the exact intent and assertions.

Use the Copy link button to share a direct URL to a specific task + model combination.

Reading Conversation Traces

The right panel shows each step of the LLM's interaction:

Step N: the LLM's reasoning text (what it decided to do)
Tool call: the shell command it chose to run, with arguments
Tool result: stdout, stderr, and exit code

Walk through the trace to identify where things went wrong:

Wrong command

The LLM ran a different command than expected. This usually means:

The intent is ambiguous; rephrase to be more specific
The LLM doesn't know your CLI well enough; add a system_prompt with usage hints
The --help output doesn't cover this use case; improve your help text

Wrong flags

The LLM found the right command but used wrong flags. This means:

Your --help output isn't clear about flag names or syntax
The LLM is confusing your CLI's flags with a similar CLI; add a system_prompt clarifying flag syntax

Ran out of turns

The LLM hit max_turns before completing the task:

Increase max_turns for complex tasks (default is 5)
Check if the LLM is spinning (retrying the same failing command repeatedly)
Simplify the task by splitting into smaller steps

Infrastructure error

A dependency, server, or environment setup was missing:

Add required setup to the task's setup field
Check that your CLI is installed in the CI environment
Verify environment variables are set

Comparing repeats

If a task is flaky (passes sometimes, fails others), open the Task Detail Dialog and switch between repeats. Look for:

Different commands: the LLM tried a different approach in different attempts
Non-deterministic output: the CLI produced different output each time
Timing issues: a dependency wasn't ready (e.g., server startup race condition)

Handling Flaky Tasks

Flaky tasks pass sometimes and fail others. The dashboard's "Quick Stats" section identifies them automatically.

Diagnosing flakiness

Use repeat: N to run a task multiple times and get a statistical pass rate:

tasks:
  - id: flaky-task
    intent: "Create a project with the default template"
    repeat: 10
    assert:
      - exit_code: 0
      - file_exists: "project/package.json"

The dashboard shows aggregate results (e.g., 7/10 passed) and lets you browse individual repeats to see what differs between passing and failing attempts.

Reducing flakiness

Broaden regex patterns: ran: "npm install" is more stable than ran: "npm install --save-exact"
Increase max_turns: give the LLM room to recover from initial mistakes
Add setup for deterministic state: use setup commands to create a known starting environment
Use output_contains over output_equals: partial matches tolerate format variations
Mark informational during development: use behavior: informational while iterating, so flaky tasks don't block the quality gate

Iterating on Tasks

Start permissive, tighten gradually

Begin with minimal assertions (just exit_code: 0 and one ran)
Run across multiple models to see what the LLM actually does
Add more specific assertions based on observed behavior
Tighten regex patterns once you're confident in the expected behavior

Narrow scope during development

Use filters to focus on the task you're working on:

# Run only one task against one model
cli-bench --filter create-project --models anthropic/claude-haiku-4-5-20251001

# Run a subset of tasks
cli-bench --filter create-project,show-help,build-app

This is much faster and cheaper than running the full suite.

Use --dry-run to check prompts

Before spending API credits, verify the prompt looks right:

cli-bench --dry-run --filter create-project

Check the dashboard after each iteration

Upload results and check the Task Detail Dialog to see:

Which assertions passed and which failed
The exact commands the LLM ran
Where the conversation went wrong

Reproducing CI Failures Locally​

Inspecting Prompts with --dry-run​

Reading Assertion Failures​

Using the Task Detail Dialog​

Navigating the Dialog​

Reading Conversation Traces​

Wrong command​

Wrong flags​

Ran out of turns​

Infrastructure error​

Comparing repeats​

Handling Flaky Tasks​

Diagnosing flakiness​

Reducing flakiness​

Iterating on Tasks​

Start permissive, tighten gradually​

Narrow scope during development​

Use --dry-run to check prompts​

Check the dashboard after each iteration​