Writing Effective Tasks

Each benchmark consists of tasks — scenarios where an LLM is given an intent and your CLI's help text, then evaluated on whether it used the CLI correctly.

Writing Good Intents

Intents should be specific, goal-oriented, and realistic:

Good intents:

"List all running Docker containers and show their ports"
"Create a new Git branch called 'feature/auth' from main"
"Compress the file 'data.csv' using gzip with maximum compression"

Bad intents:

"Use the CLI" (too vague)
"Run docker ps -a --format '{{.Names}}'" (gives away the answer)
"Do something with files" (not goal-oriented)

The intent should describe what the user wants, not how to do it.

Difficulty Guidelines

easy: Single command, common flags, straightforward output
- Example: "Show the version of the CLI"
- Expected: 1-2 turns, basic assertions
medium: Multiple flags, specific output formats, common workflows
- Example: "List files modified in the last 24 hours, sorted by size"
- Expected: 2-4 turns, multiple assertions
hard: Multi-step workflows, error recovery, complex flag combinations
- Example: "Set up a multi-stage Docker build with a slim production image"
- Expected: 3-5+ turns, file and output assertions

Testing Strategy

Build your benchmark suite in layers:

1. Discovery (easy)

Can the LLM find and understand your CLI's help?

- id: show-help
  intent: "Show the help information for mycli"
  difficulty: easy
  assert:
    - exit_code: 0
    - output_contains: "usage"

2. Core Workflows (medium)

Can the LLM perform your CLI's primary use cases?

- id: create-project
  intent: "Create a new project called 'my-app' in the current directory"
  difficulty: medium
  assert:
    - exit_code: 0
    - file_exists: "my-app/package.json"

3. Flag Combinations (medium)

Can the LLM use multiple flags correctly?

- id: list-with-filters
  intent: "List all items with status 'active', sorted by date, in JSON format"
  difficulty: medium
  assert:
    - exit_code: 0
    - output_contains: '"status": "active"'

4. Error Handling (medium-hard)

Does the LLM recover from errors gracefully?

- id: handle-missing-file
  intent: "Try to read config from 'nonexistent.yaml' and handle the error"
  difficulty: medium
  assert:
    - error_contains: "not found"

5. Advanced Workflows (hard)

Can the LLM chain commands and handle complex scenarios?

- id: full-workflow
  intent: "Initialize a project, add two dependencies, and verify they're installed"
  difficulty: hard
  setup:
    - "mkdir test-project && cd test-project"
  assert:
    - exit_code: 0
    - file_contains:
        path: "test-project/package.json"
        text: "dependencies"

Tips

Start with 3-5 easy tasks to validate your setup
Use cliwatch validate to check your config before running
Group related tasks with category for better reporting
Use repeat on flaky tasks to get more reliable pass rates
See the full Assertions Reference for all 10 assertion types

Writing Good Intents​

Difficulty Guidelines​

Testing Strategy​

1. Discovery (easy)​

2. Core Workflows (medium)​

3. Flag Combinations (medium)​

4. Error Handling (medium-hard)​

5. Advanced Workflows (hard)​

Tips​