Reading the Dashboard
After uploading benchmark results, the CLIWatch dashboard helps you understand how well AI agents can use your CLI. This guide walks through each view.
Projects Overview
The main page at app.cliwatch.com shows all CLI profiles in your workspace as a searchable card grid.
Each project card displays:
- CLI name and logo
- Latest pass rate with color-coded progress bar (green 90%+, amber 70%+, red below)
- Repository link (auto-populated from CI uploads, editable in settings)
- Run count and last run timestamp
Use the search bar to filter projects by name. Stats above the grid show totals across all CLIs.
Grades
| Grade | Pass Rate | Label |
|---|---|---|
| A | >= 90% | Agent Ready |
| B | 75-89% | Almost There |
| C | 50-74% | Room to Grow |
| D | < 50% | Early Stage |
Click any card to open its dashboard.
CLI Dashboard
The CLI dashboard is the primary view for a single CLI. It combines trends, model comparison, and recent activity into one page.
Scope Toggle
At the top, switch between scopes:
- Releases (default): show data from release runs only
- All: include PR, CI, and local runs
- PR #N: filter to a specific pull request (searchable dropdown)
Stat Tabs + Trend Chart
Three clickable metric tabs sit above a trend chart:
- Pass Rate: percentage of tasks passed, with delta vs. the previous release (green for improvement, red for regression)
- Avg Turns: average LLM interaction rounds per task
- Avg Tokens: average token usage per task, with input/output breakdown
Click a tab to switch the trend chart below it. The chart shows the selected metric over time, with interactive hover for individual runs.
Models
A horizontal bar chart comparing pass rates across models. If two or more releases exist, a Compare link opens the comparison view.
Releases
The 8 most recent releases with version labels and pass rate bars.
Stability + Changes
Two cards summarizing task behavior:
- Stability: counts of stable, always-failing, and flaky tasks
- Since [previous version]: regressions (red) and improvements (green) with affected task IDs
Activity Log
A filterable table of all runs for this CLI. Filter by type: All, Releases, PRs, CI, or Local.
Each row shows:
- Run type badge: Release (blue), PR (purple), CI (gray), Local (orange)
- Run number: per-CLI sequential number (Run #1, #2, #3, etc.)
- Version and git ref
- Timestamp
- Link to view the full run detail
Run Switcher
A dropdown in the breadcrumb navigation lets you jump directly to any run. It is searchable by run number or CLI version.
Run Detail
Click a run from the activity log (or use the run switcher) to see its full results.
Metadata
The header shows:
- Per-CLI run number and CLI version
- Git commit SHA (linked to repository if available)
- Branch name and PR number
- CI build log link (if run in CI)
- Timestamp
Results Matrix
A task-by-model grid. Each cell shows:
- Pass/fail indicator (green check or red X)
- Turns used
- Tokens used
For tasks with repeat: N, cells show a fraction (e.g., 3/4 passed).
Reading patterns:
- A task that fails on all models: the intent or assertions may need adjustment
- A task that fails on one model: that model struggles with this specific pattern
- Consistently high turns: the CLI may be hard to discover or the intent is ambiguous
Controls:
- Sort toggle: "Failures first" or "A-Z"
- Category filter: filter by task category
- Search: find tasks by ID
Click any cell to open the Task Detail Dialog.
Task Detail Dialog
Click any cell in the results matrix to open the detail dialog. This is the primary tool for debugging failures.
The dialog has a two-column layout:
Left Panel: Assertion Results
Each assertion shows whether it passed:
- Passed: green check with assertion summary
- Failed: red X with the assertion type, expected value, actual value, and error message
For example, a failed ran assertion shows the regex pattern it expected and the list of commands that were actually executed.
Right Panel: Conversation Trace
The step-by-step interaction between the LLM and the tool:
- Model thinking: what the LLM decided to do (labeled "Step 1", "Step 2", etc.)
- Tool call: the command the LLM chose to run (shown with arguments)
- Tool result: stdout, stderr, and exit code from the command
Walk through the trace to identify where the agent went wrong:
- Wrong command? The LLM misunderstood the task; improve your intent
- Wrong flags? The LLM knows the command but not the options; improve
--helpoutput - Ran out of turns? Increase
max_turnsfor this task - Infrastructure error? A dependency was missing; add it to
setup
See the Debugging Benchmarks guide for detailed strategies.
Top Controls
Above the two panels:
- Model selector: switch between models to see how each performed on this task
- Stats row: turns used, tokens used, and assertions passed/total
- Task definition: expandable YAML showing the task's intent, assertions, and setup
Repeat Navigation
If a task was run multiple times (via repeat: N), a repeat selector appears below the model buttons. Each repeat is numbered (#0, #1, #2, ...) with a green or red ring indicating pass/fail. Click a repeat to see its individual assertions and conversation trace.
You can share a direct link to a specific task + model combination using the Copy link button in the dialog header.
Comparing Runs
The Compare view shows how task results changed across multiple runs for a single model. Use it to track regressions and improvements over time.
Opening Compare
Two ways to get there:
- From Dashboard: click the Compare link in the Models section
- From URL: navigate directly with query params (
?runs=...&model=...)
Selecting Runs and Model
At the top of the Compare page:
- Model dropdown: choose which model to compare (one model at a time). Pass rates update in the run selector to reflect the chosen model.
- Runs dropdown: check/uncheck runs to include. Up to 5 runs can be compared. Each run shows its version, date, git SHA, and pass rate for the selected model.
Selections are reflected in the URL query string, so you can bookmark or share compare links.
Reading the Comparison Table
The table has one row per task and one column per selected run:
- Pass/fail: green check or red X. For repeated tasks, shows a fraction (e.g., 3/5).
- Metrics: turns and tokens per cell
- Missing: a dash if the task did not exist in that run
Regressions and Improvements
Summary badges at the top show counts:
- Regressions (red): tasks that passed in any earlier run but fail in the latest
- Improvements (green): tasks that failed in all earlier runs but now pass
- Unchanged: tasks with consistent behavior
Regression rows are highlighted with a red background; improvement rows with green.
Public Dashboard
Share your benchmark results publicly to show your CLI's agent-readiness.
Public Directory Toggle
On the Projects page, each CLI profile has a Public button (globe icon). This controls whether the CLI appears on the CLIWatch public directory:
- Green: visible on the public directory. Anyone can view this CLI's benchmark results without logging in.
- Gray: hidden from the public directory (default).
When enabled, your CLI's results are accessible at cliwatch.com/public/<workspace-slug>/bench-runs/<cli-name>. The public view is read-only and shows the same data as your private dashboard.
All benchmark data becomes publicly visible when enabled, including task intents, assertion details, and conversation traces. Make sure your tasks and setup commands do not contain sensitive information.
Badges
Go to Settings > Badges to create live SVG badge shields for your README:
- Click Create Badge and select a CLI profile from the dropdown
- A unique badge token is generated automatically
- Copy the generated Markdown embed code
- Paste it into your README
Badges update automatically with each benchmark run, showing the current pass rate and grade.
You can enable/disable individual badges without deleting them. Disabled badges show a neutral state. If you delete a badge, the URL stops resolving, so update any READMEs that embed it.