Skip to main content

Writing Effective Tasks

Each benchmark consists of tasks — scenarios where an LLM is given an intent and your CLI's help text, then evaluated on whether it used the CLI correctly.

Writing Good Intents

Intents should be specific, goal-oriented, and realistic:

Good intents:

  • "List all running Docker containers and show their ports"
  • "Create a new Git branch called 'feature/auth' from main"
  • "Compress the file 'data.csv' using gzip with maximum compression"

Bad intents:

  • "Use the CLI" (too vague)
  • "Run docker ps -a --format '{{.Names}}'" (gives away the answer)
  • "Do something with files" (not goal-oriented)

The intent should describe what the user wants, not how to do it.

Difficulty Guidelines

  • easy: Single command, common flags, straightforward output
    • Example: "Show the version of the CLI"
    • Expected: 1-2 turns, basic assertions
  • medium: Multiple flags, specific output formats, common workflows
    • Example: "List files modified in the last 24 hours, sorted by size"
    • Expected: 2-4 turns, multiple assertions
  • hard: Multi-step workflows, error recovery, complex flag combinations
    • Example: "Set up a multi-stage Docker build with a slim production image"
    • Expected: 3-5+ turns, file and output assertions

Testing Strategy

Build your benchmark suite in layers:

1. Discovery (easy)

Can the LLM find and understand your CLI's help?

- id: show-help
intent: "Show the help information for mycli"
difficulty: easy
assert:
- exit_code: 0
- output_contains: "usage"

2. Core Workflows (medium)

Can the LLM perform your CLI's primary use cases?

- id: create-project
intent: "Create a new project called 'my-app' in the current directory"
difficulty: medium
assert:
- exit_code: 0
- file_exists: "my-app/package.json"

3. Flag Combinations (medium)

Can the LLM use multiple flags correctly?

- id: list-with-filters
intent: "List all items with status 'active', sorted by date, in JSON format"
difficulty: medium
assert:
- exit_code: 0
- output_contains: '"status": "active"'

4. Error Handling (medium-hard)

Does the LLM recover from errors gracefully?

- id: handle-missing-file
intent: "Try to read config from 'nonexistent.yaml' and handle the error"
difficulty: medium
assert:
- error_contains: "not found"

5. Advanced Workflows (hard)

Can the LLM chain commands and handle complex scenarios?

- id: full-workflow
intent: "Initialize a project, add two dependencies, and verify they're installed"
difficulty: hard
setup:
- "mkdir test-project && cd test-project"
assert:
- exit_code: 0
- file_contains:
path: "test-project/package.json"
text: "dependencies"

Tips

  • Start with 3-5 easy tasks to validate your setup
  • Use cliwatch validate to check your config before running
  • Group related tasks with category for better reporting
  • Use repeat on flaky tasks to get more reliable pass rates
  • See the full Assertions Reference for all 10 assertion types