Debugging Benchmarks
When benchmarks fail (in CI or locally), this guide helps you find and fix the problem.
Reproducing CI Failures Locally
- Open the failing run in the dashboard and note the model
- Set your environment variables:
export AI_GATEWAY_API_KEY="vck_..."
export CLIWATCH_API_KEY="cw_..." - Run the specific failing combination:
cli-bench --models anthropic/claude-sonnet-4.6 --filter failing-task-id
Using --filter and --models narrows the run to exactly what failed, saving time and API credits.
Inspecting Prompts with --dry-run
The --dry-run flag prints the exact prompt that would be sent to the LLM, without making any API calls or requiring an AI_GATEWAY_API_KEY.
cli-bench --dry-run
The output shows the system message and user message for the first task. Check that:
- The CLI name is correct
- Your
system_promptcontent appears (if configured) - The task intent is clear and specific
Reading Assertion Failures
Each assertion type produces a different failure message. Here's what to look for:
| Assertion | Failure shows | Common fix |
|---|---|---|
exit_code | Expected vs. actual exit code (e.g., "expected 0, got 1") | Check if the CLI actually supports the command. Add setup to install dependencies. |
output_contains | Expected substring vs. actual stdout | Broaden the match string. Check if the output format changed. |
output_equals | Expected vs. actual full output | Use output_contains instead; exact matches are brittle. |
error_contains | Expected substring vs. actual stderr | Verify the error message format hasn't changed. |
file_exists | "not found" vs. expected path | Check working directory. The path is relative to the task workdir. |
file_contains | Expected text vs. actual file contents | Broaden the match. Check if file format changed. |
ran | Regex pattern vs. list of commands actually run | Broaden the regex to allow flag reordering (e.g., git commit.*-m not git commit -m). |
not_ran | Regex pattern matched a command that shouldn't have been run | Tighten the regex to avoid false positives. |
run_count | Actual count vs. expected range (e.g., "2 not in 3..*") | Adjust min/max or investigate why the command ran fewer/more times. |
verify | Verification command output vs. expected | Run the verify command manually to check what it produces. |
Using the Task Detail Dialog
The Task Detail Dialog is the primary debugging tool. Open it by clicking any cell in the task × model grid.
Navigating the Dialog
- Pick a model: buttons across the top show each model with its pass/fail status. Click to switch.
- Pick a repeat: if the task used
repeat: N, numbered buttons (#0, #1, ...) appear below the model selector. Each has a colored ring (green = passed, red = failed). Click different repeats to compare what went differently. - Check stats: the stats row shows turns used, tokens consumed, and assertions passed out of total.
- Read assertions (left panel): see which assertions passed and which failed, with expected vs. actual values.
- Walk the trace (right panel): step through the LLM's reasoning and commands.
- View task YAML: expand the task definition section to see the exact intent and assertions.
Use the Copy link button to share a direct URL to a specific task + model combination.
Reading Conversation Traces
The right panel shows each step of the LLM's interaction:
- Step N: the LLM's reasoning text (what it decided to do)
- Tool call: the shell command it chose to run, with arguments
- Tool result: stdout, stderr, and exit code
Walk through the trace to identify where things went wrong:
Wrong command
The LLM ran a different command than expected. This usually means:
- The intent is ambiguous; rephrase to be more specific
- The LLM doesn't know your CLI well enough; add a
system_promptwith usage hints - The
--helpoutput doesn't cover this use case; improve your help text
Wrong flags
The LLM found the right command but used wrong flags. This means:
- Your
--helpoutput isn't clear about flag names or syntax - The LLM is confusing your CLI's flags with a similar CLI; add a
system_promptclarifying flag syntax
Ran out of turns
The LLM hit max_turns before completing the task:
- Increase
max_turnsfor complex tasks (default is 5) - Check if the LLM is spinning (retrying the same failing command repeatedly)
- Simplify the task by splitting into smaller steps
Infrastructure error
A dependency, server, or environment setup was missing:
- Add required setup to the task's
setupfield - Check that your CLI is installed in the CI environment
- Verify environment variables are set
Comparing repeats
If a task is flaky (passes sometimes, fails others), open the Task Detail Dialog and switch between repeats. Look for:
- Different commands: the LLM tried a different approach in different attempts
- Non-deterministic output: the CLI produced different output each time
- Timing issues: a dependency wasn't ready (e.g., server startup race condition)
Handling Flaky Tasks
Flaky tasks pass sometimes and fail others. The dashboard's "Quick Stats" section identifies them automatically.
Diagnosing flakiness
Use repeat: N to run a task multiple times and get a statistical pass rate:
tasks:
- id: flaky-task
intent: "Create a project with the default template"
repeat: 10
assert:
- exit_code: 0
- file_exists: "project/package.json"
The dashboard shows aggregate results (e.g., 7/10 passed) and lets you browse individual repeats to see what differs between passing and failing attempts.
Reducing flakiness
- Broaden regex patterns:
ran: "npm install"is more stable thanran: "npm install --save-exact" - Increase max_turns: give the LLM room to recover from initial mistakes
- Add setup for deterministic state: use
setupcommands to create a known starting environment - Use
output_containsoveroutput_equals: partial matches tolerate format variations - Mark informational during development: use
behavior: informationalwhile iterating, so flaky tasks don't block the quality gate
Iterating on Tasks
Start permissive, tighten gradually
- Begin with minimal assertions (just
exit_code: 0and oneran) - Run across multiple models to see what the LLM actually does
- Add more specific assertions based on observed behavior
- Tighten regex patterns once you're confident in the expected behavior
Narrow scope during development
Use filters to focus on the task you're working on:
# Run only one task against one model
cli-bench --filter create-project --models anthropic/claude-haiku-4-5-20251001
# Run a subset of tasks
cli-bench --filter create-project,show-help,build-app
This is much faster and cheaper than running the full suite.
Use --dry-run to check prompts
Before spending API credits, verify the prompt looks right:
cli-bench --dry-run --filter create-project
Check the dashboard after each iteration
Upload results and check the Task Detail Dialog to see:
- Which assertions passed and which failed
- The exact commands the LLM ran
- Where the conversation went wrong