Writing Effective Tasks
Each benchmark consists of tasks — scenarios where an LLM is given an intent and your CLI's help text, then evaluated on whether it used the CLI correctly.
Writing Good Intents
Intents should be specific, goal-oriented, and realistic:
Good intents:
- "List all running Docker containers and show their ports"
- "Create a new Git branch called 'feature/auth' from main"
- "Compress the file 'data.csv' using gzip with maximum compression"
Bad intents:
- "Use the CLI" (too vague)
- "Run docker ps -a --format '{{.Names}}'" (gives away the answer)
- "Do something with files" (not goal-oriented)
The intent should describe what the user wants, not how to do it.
Difficulty Guidelines
- easy: Single command, common flags, straightforward output
- Example: "Show the version of the CLI"
- Expected: 1-2 turns, basic assertions
- medium: Multiple flags, specific output formats, common workflows
- Example: "List files modified in the last 24 hours, sorted by size"
- Expected: 2-4 turns, multiple assertions
- hard: Multi-step workflows, error recovery, complex flag combinations
- Example: "Set up a multi-stage Docker build with a slim production image"
- Expected: 3-5+ turns, file and output assertions
Testing Strategy
Build your benchmark suite in layers:
1. Discovery (easy)
Can the LLM find and understand your CLI's help?
- id: show-help
intent: "Show the help information for mycli"
difficulty: easy
assert:
- exit_code: 0
- output_contains: "usage"
2. Core Workflows (medium)
Can the LLM perform your CLI's primary use cases?
- id: create-project
intent: "Create a new project called 'my-app' in the current directory"
difficulty: medium
assert:
- exit_code: 0
- file_exists: "my-app/package.json"
3. Flag Combinations (medium)
Can the LLM use multiple flags correctly?
- id: list-with-filters
intent: "List all items with status 'active', sorted by date, in JSON format"
difficulty: medium
assert:
- exit_code: 0
- output_contains: '"status": "active"'
4. Error Handling (medium-hard)
Does the LLM recover from errors gracefully?
- id: handle-missing-file
intent: "Try to read config from 'nonexistent.yaml' and handle the error"
difficulty: medium
assert:
- error_contains: "not found"
5. Advanced Workflows (hard)
Can the LLM chain commands and handle complex scenarios?
- id: full-workflow
intent: "Initialize a project, add two dependencies, and verify they're installed"
difficulty: hard
setup:
- "mkdir test-project && cd test-project"
assert:
- exit_code: 0
- file_contains:
path: "test-project/package.json"
text: "dependencies"
Tips
- Start with 3-5 easy tasks to validate your setup
- Use
cliwatch validateto check your config before running - Group related tasks with
categoryfor better reporting - Use
repeaton flaky tasks to get more reliable pass rates - See the full Assertions Reference for all 10 assertion types