Skip to main content

Reading the Dashboard

After uploading benchmark results, the CLIWatch dashboard helps you understand how well AI agents can use your CLI. This guide walks through each view.

Projects Overview

The main page at app.cliwatch.com shows all CLI profiles in your workspace as a searchable card grid.

Each project card displays:

  • CLI name and logo
  • Latest pass rate with color-coded progress bar (green 90%+, amber 70%+, red below)
  • Repository link (auto-populated from CI uploads, editable in settings)
  • Run count and last run timestamp

Use the search bar to filter projects by name. Stats above the grid show totals across all CLIs.

Grades

GradePass RateLabel
A>= 90%Agent Ready
B75-89%Almost There
C50-74%Room to Grow
D< 50%Early Stage

Click any card to open its dashboard.

CLI Dashboard

The CLI dashboard is the primary view for a single CLI. It combines trends, model comparison, and recent activity into one page.

Scope Toggle

At the top, switch between scopes:

  • Releases (default): show data from release runs only
  • All: include PR, CI, and local runs
  • PR #N: filter to a specific pull request (searchable dropdown)

Stat Tabs + Trend Chart

Three clickable metric tabs sit above a trend chart:

  • Pass Rate: percentage of tasks passed, with delta vs. the previous release (green for improvement, red for regression)
  • Avg Turns: average LLM interaction rounds per task
  • Avg Tokens: average token usage per task, with input/output breakdown

Click a tab to switch the trend chart below it. The chart shows the selected metric over time, with interactive hover for individual runs.

Models

A horizontal bar chart comparing pass rates across models. If two or more releases exist, a Compare link opens the comparison view.

Releases

The 8 most recent releases with version labels and pass rate bars.

Stability + Changes

Two cards summarizing task behavior:

  • Stability: counts of stable, always-failing, and flaky tasks
  • Since [previous version]: regressions (red) and improvements (green) with affected task IDs

Activity Log

A filterable table of all runs for this CLI. Filter by type: All, Releases, PRs, CI, or Local.

Each row shows:

  • Run type badge: Release (blue), PR (purple), CI (gray), Local (orange)
  • Run number: per-CLI sequential number (Run #1, #2, #3, etc.)
  • Version and git ref
  • Timestamp
  • Link to view the full run detail

Run Switcher

A dropdown in the breadcrumb navigation lets you jump directly to any run. It is searchable by run number or CLI version.

Run Detail

Click a run from the activity log (or use the run switcher) to see its full results.

Metadata

The header shows:

  • Per-CLI run number and CLI version
  • Git commit SHA (linked to repository if available)
  • Branch name and PR number
  • CI build log link (if run in CI)
  • Timestamp

Results Matrix

A task-by-model grid. Each cell shows:

  • Pass/fail indicator (green check or red X)
  • Turns used
  • Tokens used

For tasks with repeat: N, cells show a fraction (e.g., 3/4 passed).

Reading patterns:

  • A task that fails on all models: the intent or assertions may need adjustment
  • A task that fails on one model: that model struggles with this specific pattern
  • Consistently high turns: the CLI may be hard to discover or the intent is ambiguous

Controls:

  • Sort toggle: "Failures first" or "A-Z"
  • Category filter: filter by task category
  • Search: find tasks by ID

Click any cell to open the Task Detail Dialog.

Task Detail Dialog

Click any cell in the results matrix to open the detail dialog. This is the primary tool for debugging failures.

The dialog has a two-column layout:

Left Panel: Assertion Results

Each assertion shows whether it passed:

  • Passed: green check with assertion summary
  • Failed: red X with the assertion type, expected value, actual value, and error message

For example, a failed ran assertion shows the regex pattern it expected and the list of commands that were actually executed.

Right Panel: Conversation Trace

The step-by-step interaction between the LLM and the tool:

  1. Model thinking: what the LLM decided to do (labeled "Step 1", "Step 2", etc.)
  2. Tool call: the command the LLM chose to run (shown with arguments)
  3. Tool result: stdout, stderr, and exit code from the command

Walk through the trace to identify where the agent went wrong:

  • Wrong command? The LLM misunderstood the task; improve your intent
  • Wrong flags? The LLM knows the command but not the options; improve --help output
  • Ran out of turns? Increase max_turns for this task
  • Infrastructure error? A dependency was missing; add it to setup

See the Debugging Benchmarks guide for detailed strategies.

Top Controls

Above the two panels:

  • Model selector: switch between models to see how each performed on this task
  • Stats row: turns used, tokens used, and assertions passed/total
  • Task definition: expandable YAML showing the task's intent, assertions, and setup

Repeat Navigation

If a task was run multiple times (via repeat: N), a repeat selector appears below the model buttons. Each repeat is numbered (#0, #1, #2, ...) with a green or red ring indicating pass/fail. Click a repeat to see its individual assertions and conversation trace.

You can share a direct link to a specific task + model combination using the Copy link button in the dialog header.

Comparing Runs

The Compare view shows how task results changed across multiple runs for a single model. Use it to track regressions and improvements over time.

Opening Compare

Two ways to get there:

  • From Dashboard: click the Compare link in the Models section
  • From URL: navigate directly with query params (?runs=...&model=...)

Selecting Runs and Model

At the top of the Compare page:

  • Model dropdown: choose which model to compare (one model at a time). Pass rates update in the run selector to reflect the chosen model.
  • Runs dropdown: check/uncheck runs to include. Up to 5 runs can be compared. Each run shows its version, date, git SHA, and pass rate for the selected model.

Selections are reflected in the URL query string, so you can bookmark or share compare links.

Reading the Comparison Table

The table has one row per task and one column per selected run:

  • Pass/fail: green check or red X. For repeated tasks, shows a fraction (e.g., 3/5).
  • Metrics: turns and tokens per cell
  • Missing: a dash if the task did not exist in that run

Regressions and Improvements

Summary badges at the top show counts:

  • Regressions (red): tasks that passed in any earlier run but fail in the latest
  • Improvements (green): tasks that failed in all earlier runs but now pass
  • Unchanged: tasks with consistent behavior

Regression rows are highlighted with a red background; improvement rows with green.

Public Dashboard

Share your benchmark results publicly to show your CLI's agent-readiness.

Public Directory Toggle

On the Projects page, each CLI profile has a Public button (globe icon). This controls whether the CLI appears on the CLIWatch public directory:

  • Green: visible on the public directory. Anyone can view this CLI's benchmark results without logging in.
  • Gray: hidden from the public directory (default).

When enabled, your CLI's results are accessible at cliwatch.com/public/<workspace-slug>/bench-runs/<cli-name>. The public view is read-only and shows the same data as your private dashboard.

caution

All benchmark data becomes publicly visible when enabled, including task intents, assertion details, and conversation traces. Make sure your tasks and setup commands do not contain sensitive information.

Badges

Go to Settings > Badges to create live SVG badge shields for your README:

  1. Click Create Badge and select a CLI profile from the dropdown
  2. A unique badge token is generated automatically
  3. Copy the generated Markdown embed code
  4. Paste it into your README

Badges update automatically with each benchmark run, showing the current pass rate and grade.

You can enable/disable individual badges without deleting them. Disabled badges show a neutral state. If you delete a badge, the URL stops resolving, so update any READMEs that embed it.