5.9 KiB
Benchmark Mode Reference
Requires subagents. Benchmark mode relies on parallel execution of many independent runs. Without subagents, use Eval mode for individual eval testing instead.
Benchmark mode runs a standardized, opinionated evaluation of a skill. It answers: "How well does this skill perform?"
Unlike Eval mode (which runs individual evals), Benchmark mode:
- Runs all evals (or a user-specified subset)
- Runs each eval 3 times per configuration (with_skill and without_skill) for variance
- Captures metrics with variance: pass_rate, time_seconds, tokens
- Uses the most capable model as analyzer to surface patterns and anomalies
- Produces persistent, structured output for cross-run analysis
When to Use
- Understanding performance: "How does my skill perform?"
- Cross-model comparison: "How does this skill perform across different models?"
- Regression detection: Compare benchmark results over time
- Skill validation: Does the skill actually add value over no-skill baseline?
Defaults
Always include no-skill baseline. Every benchmark runs both with_skill and without_skill configurations. This measures the value the skill adds - without a baseline, you can't know if the skill helps.
Suggest models for comparison. If the user wants to compare across models, suggest a couple of commonly available models in your environment. Don't hardcode model names - just recommend what's commonly used and available.
Run on current model by default. If the user doesn't specify, run on whatever model is currently active. For cross-model comparison, ask which models they'd like to test.
Terminology
| Term | Definition |
|---|---|
| Run | A single execution of a skill on an eval prompt |
| Configuration | The experimental condition: with_skill or without_skill |
| RunResult | Graded output of a run: expectations, metrics, notes |
| Run Summary | Statistical aggregates across runs: mean, stddev, min, max |
| Notes | Freeform observations from the analyzer |
Workflow
1. Setup
→ Choose workspace location (ask user, suggest <skill>-workspace/)
→ Verify evals exist
→ Determine which evals to run (all by default, or user subset)
2. Execute runs (parallel where possible)
→ For each eval:
→ 3 runs with_skill configuration
→ 3 runs without_skill configuration
→ Each run captures: transcript, outputs, metrics
→ Coordinator extracts from Task result: total_tokens, tool_uses, duration_ms
3. Grade runs (parallel)
→ Spawn grader for each run
→ Produces: expectations with pass/fail, notes
4. Aggregate results
→ Calculate run_summary per configuration:
→ pass_rate: mean, stddev, min, max
→ time_seconds: mean, stddev, min, max
→ tokens: mean, stddev, min, max
→ Calculate delta between configurations
5. Analyze (most capable model)
→ Review all results
→ Surface patterns, anomalies, observations as freeform notes
→ Examples:
- "Assertion X passes 100% in both configurations - may not differentiate skill value"
- "Eval 3 shows high variance (50% ± 40%) - may be flaky"
- "Skill adds 13s average time but improves pass rate by 50%"
6. Generate benchmark
→ benchmark.json - Structured data for analysis
→ benchmark.md - Human-readable summary
Spawning Executors
Run executor subagents in the background for parallelism. When each agent completes, capture the execution metrics (tokens consumed, tool calls, duration) from the completion notification.
For example, in Claude Code, background subagents deliver a <task-notification> with a <usage> block:
<!-- Example: Claude Code task notification format -->
<task-notification>
<task-id>...</task-id>
<status>completed</status>
<result>agent's actual text output</result>
<usage>total_tokens: 3700
tool_uses: 2
duration_ms: 32400</usage>
</task-notification>
Extract from each completed executor's metrics:
- total_tokens = total tokens consumed (input + output combined)
- tool_uses = number of tool calls made
- duration_ms / 1000 = execution time in seconds
The exact format of completion notifications varies by environment — look for token counts, tool call counts, and duration in whatever format your environment provides.
Record these per-run metrics alongside the grading results. The aggregate script can then compute mean/stddev/min/max across runs for each configuration.
Scripts
Use these scripts at specific points in the workflow:
After Grading (Step 4: Aggregate)
# Aggregate all grading.json files into benchmark summary
scripts/aggregate_benchmark.py <benchmark-dir> --skill-name <name> --skill-path <path>
This reads grading.json from each run directory and produces:
benchmark.json- Structured results with run_summary statisticsbenchmark.md- Human-readable summary table
Validation
# Validate benchmark output
scripts/validate_json.py <benchmark-dir>/benchmark.json
# Validate individual grading files
scripts/validate_json.py <run-dir>/grading.json --type grading
Initialize Templates
# Create empty benchmark.json with correct structure (if not using aggregate script)
scripts/init_json.py benchmark <benchmark-dir>/benchmark.json
Analyzer Instructions
The analyzer (always most capable) reviews all results and generates freeform notes. See agents/analyzer.md for the full prompt, but key responsibilities:
- Compare configurations: Which performed better? By how much?
- Identify patterns: Assertions that always pass/fail, high variance, etc.
- Surface anomalies: Unexpected results, broken evals, regressions
- Provide context: Why might these patterns exist?
The analyzer should NOT:
- Suggest improvements to the skill (that's Improve mode)
- Make subjective quality judgments beyond the data
- Speculate without evidence