Skip to main content

Debugging Benchmarks

When benchmarks fail (in CI or locally), this guide helps you find and fix the problem.

Reproducing CI Failures Locally

  1. Open the failing run in the dashboard and note the model
  2. Set your environment variables:
    export AI_GATEWAY_API_KEY="vck_..."
    export CLIWATCH_API_KEY="cw_..."
  3. Run the specific failing combination:
    cli-bench --models anthropic/claude-sonnet-4.6 --filter failing-task-id

Using --filter and --models narrows the run to exactly what failed, saving time and API credits.

Inspecting Prompts with --dry-run

The --dry-run flag prints the exact prompt that would be sent to the LLM, without making any API calls or requiring an AI_GATEWAY_API_KEY.

cli-bench --dry-run

The output shows the system message and user message for the first task. Check that:

  • The CLI name is correct
  • Your system_prompt content appears (if configured)
  • The task intent is clear and specific

Reading Assertion Failures

Each assertion type produces a different failure message. Here's what to look for:

AssertionFailure showsCommon fix
exit_codeExpected vs. actual exit code (e.g., "expected 0, got 1")Check if the CLI actually supports the command. Add setup to install dependencies.
output_containsExpected substring vs. actual stdoutBroaden the match string. Check if the output format changed.
output_equalsExpected vs. actual full outputUse output_contains instead; exact matches are brittle.
error_containsExpected substring vs. actual stderrVerify the error message format hasn't changed.
file_exists"not found" vs. expected pathCheck working directory. The path is relative to the task workdir.
file_containsExpected text vs. actual file contentsBroaden the match. Check if file format changed.
ranRegex pattern vs. list of commands actually runBroaden the regex to allow flag reordering (e.g., git commit.*-m not git commit -m).
not_ranRegex pattern matched a command that shouldn't have been runTighten the regex to avoid false positives.
run_countActual count vs. expected range (e.g., "2 not in 3..*")Adjust min/max or investigate why the command ran fewer/more times.
verifyVerification command output vs. expectedRun the verify command manually to check what it produces.

Using the Task Detail Dialog

The Task Detail Dialog is the primary debugging tool. Open it by clicking any cell in the task × model grid.

  1. Pick a model: buttons across the top show each model with its pass/fail status. Click to switch.
  2. Pick a repeat: if the task used repeat: N, numbered buttons (#0, #1, ...) appear below the model selector. Each has a colored ring (green = passed, red = failed). Click different repeats to compare what went differently.
  3. Check stats: the stats row shows turns used, tokens consumed, and assertions passed out of total.
  4. Read assertions (left panel): see which assertions passed and which failed, with expected vs. actual values.
  5. Walk the trace (right panel): step through the LLM's reasoning and commands.
  6. View task YAML: expand the task definition section to see the exact intent and assertions.

Use the Copy link button to share a direct URL to a specific task + model combination.

Reading Conversation Traces

The right panel shows each step of the LLM's interaction:

  • Step N: the LLM's reasoning text (what it decided to do)
  • Tool call: the shell command it chose to run, with arguments
  • Tool result: stdout, stderr, and exit code

Walk through the trace to identify where things went wrong:

Wrong command

The LLM ran a different command than expected. This usually means:

  • The intent is ambiguous; rephrase to be more specific
  • The LLM doesn't know your CLI well enough; add a system_prompt with usage hints
  • The --help output doesn't cover this use case; improve your help text

Wrong flags

The LLM found the right command but used wrong flags. This means:

  • Your --help output isn't clear about flag names or syntax
  • The LLM is confusing your CLI's flags with a similar CLI; add a system_prompt clarifying flag syntax

Ran out of turns

The LLM hit max_turns before completing the task:

  • Increase max_turns for complex tasks (default is 5)
  • Check if the LLM is spinning (retrying the same failing command repeatedly)
  • Simplify the task by splitting into smaller steps

Infrastructure error

A dependency, server, or environment setup was missing:

  • Add required setup to the task's setup field
  • Check that your CLI is installed in the CI environment
  • Verify environment variables are set

Comparing repeats

If a task is flaky (passes sometimes, fails others), open the Task Detail Dialog and switch between repeats. Look for:

  • Different commands: the LLM tried a different approach in different attempts
  • Non-deterministic output: the CLI produced different output each time
  • Timing issues: a dependency wasn't ready (e.g., server startup race condition)

Handling Flaky Tasks

Flaky tasks pass sometimes and fail others. The dashboard's "Quick Stats" section identifies them automatically.

Diagnosing flakiness

Use repeat: N to run a task multiple times and get a statistical pass rate:

tasks:
- id: flaky-task
intent: "Create a project with the default template"
repeat: 10
assert:
- exit_code: 0
- file_exists: "project/package.json"

The dashboard shows aggregate results (e.g., 7/10 passed) and lets you browse individual repeats to see what differs between passing and failing attempts.

Reducing flakiness

  • Broaden regex patterns: ran: "npm install" is more stable than ran: "npm install --save-exact"
  • Increase max_turns: give the LLM room to recover from initial mistakes
  • Add setup for deterministic state: use setup commands to create a known starting environment
  • Use output_contains over output_equals: partial matches tolerate format variations
  • Mark informational during development: use behavior: informational while iterating, so flaky tasks don't block the quality gate

Iterating on Tasks

Start permissive, tighten gradually

  1. Begin with minimal assertions (just exit_code: 0 and one ran)
  2. Run across multiple models to see what the LLM actually does
  3. Add more specific assertions based on observed behavior
  4. Tighten regex patterns once you're confident in the expected behavior

Narrow scope during development

Use filters to focus on the task you're working on:

# Run only one task against one model
cli-bench --filter create-project --models anthropic/claude-haiku-4-5-20251001

# Run a subset of tasks
cli-bench --filter create-project,show-help,build-app

This is much faster and cheaper than running the full suite.

Use --dry-run to check prompts

Before spending API credits, verify the prompt looks right:

cli-bench --dry-run --filter create-project

Check the dashboard after each iteration

Upload results and check the Task Detail Dialog to see:

  • Which assertions passed and which failed
  • The exact commands the LLM ran
  • Where the conversation went wrong