mirror of
https://github.com/anthropics/claude-plugins-official.git
synced 2026-03-19 23:23:07 +00:00
chore(skill-creator): update to latest skill-creator
This commit is contained in:
@@ -1,149 +0,0 @@
|
||||
# Benchmark Mode Reference
|
||||
|
||||
**Requires subagents.** Benchmark mode relies on parallel execution of many independent runs. Without subagents, use Eval mode for individual eval testing instead.
|
||||
|
||||
Benchmark mode runs a standardized, opinionated evaluation of a skill. It answers: **"How well does this skill perform?"**
|
||||
|
||||
Unlike Eval mode (which runs individual evals), Benchmark mode:
|
||||
- Runs **all evals** (or a user-specified subset)
|
||||
- Runs each eval **3 times per configuration** (with_skill and without_skill) for variance
|
||||
- Captures **metrics with variance**: pass_rate, time_seconds, tokens
|
||||
- Uses the **most capable model** as analyzer to surface patterns and anomalies
|
||||
- Produces **persistent, structured output** for cross-run analysis
|
||||
|
||||
## When to Use
|
||||
|
||||
- **Understanding performance**: "How does my skill perform?"
|
||||
- **Cross-model comparison**: "How does this skill perform across different models?"
|
||||
- **Regression detection**: Compare benchmark results over time
|
||||
- **Skill validation**: Does the skill actually add value over no-skill baseline?
|
||||
|
||||
## Defaults
|
||||
|
||||
**Always include no-skill baseline.** Every benchmark runs both `with_skill` and `without_skill` configurations. This measures the value the skill adds - without a baseline, you can't know if the skill helps.
|
||||
|
||||
**Suggest models for comparison.** If the user wants to compare across models, suggest a couple of commonly available models in your environment. Don't hardcode model names - just recommend what's commonly used and available.
|
||||
|
||||
**Run on current model by default.** If the user doesn't specify, run on whatever model is currently active. For cross-model comparison, ask which models they'd like to test.
|
||||
|
||||
## Terminology
|
||||
|
||||
| Term | Definition |
|
||||
|------|------------|
|
||||
| **Run** | A single execution of a skill on an eval prompt |
|
||||
| **Configuration** | The experimental condition: `with_skill` or `without_skill` |
|
||||
| **RunResult** | Graded output of a run: expectations, metrics, notes |
|
||||
| **Run Summary** | Statistical aggregates across runs: mean, stddev, min, max |
|
||||
| **Notes** | Freeform observations from the analyzer |
|
||||
|
||||
## Workflow
|
||||
|
||||
```
|
||||
1. Setup
|
||||
→ Choose workspace location (ask user, suggest <skill>-workspace/)
|
||||
→ Verify evals exist
|
||||
→ Determine which evals to run (all by default, or user subset)
|
||||
|
||||
2. Execute runs (parallel where possible)
|
||||
→ For each eval:
|
||||
→ 3 runs with_skill configuration
|
||||
→ 3 runs without_skill configuration
|
||||
→ Each run captures: transcript, outputs, metrics
|
||||
→ Coordinator extracts from Task result: total_tokens, tool_uses, duration_ms
|
||||
|
||||
3. Grade runs (parallel)
|
||||
→ Spawn grader for each run
|
||||
→ Produces: expectations with pass/fail, notes
|
||||
|
||||
4. Aggregate results
|
||||
→ Calculate run_summary per configuration:
|
||||
→ pass_rate: mean, stddev, min, max
|
||||
→ time_seconds: mean, stddev, min, max
|
||||
→ tokens: mean, stddev, min, max
|
||||
→ Calculate delta between configurations
|
||||
|
||||
5. Analyze (most capable model)
|
||||
→ Review all results
|
||||
→ Surface patterns, anomalies, observations as freeform notes
|
||||
→ Examples:
|
||||
- "Assertion X passes 100% in both configurations - may not differentiate skill value"
|
||||
- "Eval 3 shows high variance (50% ± 40%) - may be flaky"
|
||||
- "Skill adds 13s average time but improves pass rate by 50%"
|
||||
|
||||
6. Generate benchmark
|
||||
→ benchmark.json - Structured data for analysis
|
||||
→ benchmark.md - Human-readable summary
|
||||
```
|
||||
|
||||
## Spawning Executors
|
||||
|
||||
Run executor subagents in the background for parallelism. When each agent completes, capture the execution metrics (tokens consumed, tool calls, duration) from the completion notification.
|
||||
|
||||
For example, in Claude Code, background subagents deliver a `<task-notification>` with a `<usage>` block:
|
||||
|
||||
```xml
|
||||
<!-- Example: Claude Code task notification format -->
|
||||
<task-notification>
|
||||
<task-id>...</task-id>
|
||||
<status>completed</status>
|
||||
<result>agent's actual text output</result>
|
||||
<usage>total_tokens: 3700
|
||||
tool_uses: 2
|
||||
duration_ms: 32400</usage>
|
||||
</task-notification>
|
||||
```
|
||||
|
||||
Extract from each completed executor's metrics:
|
||||
- **total_tokens** = total tokens consumed (input + output combined)
|
||||
- **tool_uses** = number of tool calls made
|
||||
- **duration_ms** / 1000 = execution time in seconds
|
||||
|
||||
The exact format of completion notifications varies by environment — look for token counts, tool call counts, and duration in whatever format your environment provides.
|
||||
|
||||
Record these per-run metrics alongside the grading results. The aggregate script can then compute mean/stddev/min/max across runs for each configuration.
|
||||
|
||||
## Scripts
|
||||
|
||||
Use these scripts at specific points in the workflow:
|
||||
|
||||
### After Grading (Step 4: Aggregate)
|
||||
|
||||
```bash
|
||||
# Aggregate all grading.json files into benchmark summary
|
||||
scripts/aggregate_benchmark.py <benchmark-dir> --skill-name <name> --skill-path <path>
|
||||
```
|
||||
|
||||
This reads `grading.json` from each run directory and produces:
|
||||
- `benchmark.json` - Structured results with run_summary statistics
|
||||
- `benchmark.md` - Human-readable summary table
|
||||
|
||||
### Validation
|
||||
|
||||
```bash
|
||||
# Validate benchmark output
|
||||
scripts/validate_json.py <benchmark-dir>/benchmark.json
|
||||
|
||||
# Validate individual grading files
|
||||
scripts/validate_json.py <run-dir>/grading.json --type grading
|
||||
```
|
||||
|
||||
### Initialize Templates
|
||||
|
||||
```bash
|
||||
# Create empty benchmark.json with correct structure (if not using aggregate script)
|
||||
scripts/init_json.py benchmark <benchmark-dir>/benchmark.json
|
||||
```
|
||||
|
||||
## Analyzer Instructions
|
||||
|
||||
The analyzer (always most capable) reviews all results and generates freeform notes. See `agents/analyzer.md` for the full prompt, but key responsibilities:
|
||||
|
||||
1. **Compare configurations**: Which performed better? By how much?
|
||||
2. **Identify patterns**: Assertions that always pass/fail, high variance, etc.
|
||||
3. **Surface anomalies**: Unexpected results, broken evals, regressions
|
||||
4. **Provide context**: Why might these patterns exist?
|
||||
|
||||
The analyzer should NOT:
|
||||
- Suggest improvements to the skill (that's Improve mode)
|
||||
- Make subjective quality judgments beyond the data
|
||||
- Speculate without evidence
|
||||
@@ -1,144 +0,0 @@
|
||||
# Eval Mode Reference
|
||||
|
||||
Eval mode runs skill evals and grades expectations. Enables measuring skill performance, comparing with/without skill, and validating that skills add value.
|
||||
|
||||
## Purpose
|
||||
|
||||
Evals serve to:
|
||||
1. **Set a floor** - Prove the skill helps Claude do something it couldn't by default
|
||||
2. **Raise the ceiling** - Enable iterating on skills to improve performance
|
||||
3. **Measure holistically** - Capture metrics beyond pass/fail (time, tokens)
|
||||
4. **Understand cross-model behavior** - Test skills across different models
|
||||
|
||||
## Eval Workflow
|
||||
|
||||
```
|
||||
0. Choose Workspace Location
|
||||
→ Ask user where to put workspace, suggest sensible default
|
||||
|
||||
1. Check Dependencies
|
||||
→ Scan skill for dependencies, confirm availability with user
|
||||
|
||||
2. Prepare (scripts/prepare_eval.py)
|
||||
→ Create task, copies skill, stages files
|
||||
|
||||
3. Execute (agents/executor.md)
|
||||
→ Update task to implementing, spawn executor sub-agent
|
||||
→ Executor reads skill, runs prompt, saves transcript
|
||||
|
||||
4. Grade (agents/grader.md)
|
||||
→ Update task to reviewing, spawn grader sub-agent
|
||||
→ Grader reads transcript + outputs, evaluates expectations
|
||||
|
||||
5. Complete task, display results
|
||||
→ Pass/fail per expectation, overall pass rate, metrics
|
||||
```
|
||||
|
||||
## Step 0: Setup
|
||||
|
||||
**Before running any evals, read the output schemas:**
|
||||
|
||||
```bash
|
||||
# Read to understand the JSON structures you'll produce
|
||||
Read references/schemas.md
|
||||
```
|
||||
|
||||
This ensures you know the expected structure for:
|
||||
- `grading.json` - What the grader produces
|
||||
- `metrics.json` - What the executor produces
|
||||
- `timing.json` - Wall clock timing format
|
||||
|
||||
**Choose workspace location:**
|
||||
|
||||
1. **Suggest default**: `<skill-name>-workspace/` as a sibling to the skill directory
|
||||
2. **Ask the user** using AskUserQuestion — if the workspace is inside a git repo, suggest adding it to `.gitignore`
|
||||
3. **Create the workspace directory** once confirmed
|
||||
|
||||
## Step 1: Check Dependencies
|
||||
|
||||
Before running evals, scan the skill for dependencies:
|
||||
|
||||
1. Read SKILL.md (including `compatibility` frontmatter field)
|
||||
2. Check referenced scripts for required tools
|
||||
3. Present to user and confirm availability
|
||||
|
||||
## Step 2: Prepare and Create Task
|
||||
|
||||
Run prepare script and create task:
|
||||
|
||||
```bash
|
||||
scripts/prepare_eval.py <skill-path> <eval-id> --output-dir <workspace>/eval-<id>/
|
||||
```
|
||||
|
||||
```python
|
||||
task = TaskCreate(
|
||||
subject=f"Eval {eval_id}"
|
||||
)
|
||||
TaskUpdate(task, status="planning")
|
||||
```
|
||||
|
||||
## Step 3: Execute
|
||||
|
||||
Update task to `implementing` and run the executor:
|
||||
|
||||
```bash
|
||||
echo "{\"executor_start\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\"}" > <run-dir>/timing.json
|
||||
```
|
||||
|
||||
**With subagents**: Spawn an executor subagent with these instructions:
|
||||
|
||||
```
|
||||
Read agents/executor.md at: <skill-creator-path>/agents/executor.md
|
||||
|
||||
Execute this eval:
|
||||
- Skill path: <workspace>/skill/
|
||||
- Prompt: <eval prompt from eval_metadata.json>
|
||||
- Input files: <workspace>/eval-<id>/inputs/
|
||||
- Save transcript to: <workspace>/eval-<id>/transcript.md
|
||||
- Save outputs to: <workspace>/eval-<id>/outputs/
|
||||
```
|
||||
|
||||
**Without subagents**: Read `agents/executor.md` and follow the procedure directly — execute the eval, save the transcript, and produce outputs inline.
|
||||
|
||||
After execution completes, update timing.json with executor_end and duration.
|
||||
|
||||
## Step 4: Grade
|
||||
|
||||
Update task to `reviewing` and run the grader:
|
||||
|
||||
**With subagents**: Spawn a grader subagent with these instructions:
|
||||
|
||||
```
|
||||
Read agents/grader.md at: <skill-creator-path>/agents/grader.md
|
||||
|
||||
Grade these expectations:
|
||||
- Assertions: <list from eval_metadata.json>
|
||||
- Transcript: <workspace>/eval-<id>/transcript.md
|
||||
- Outputs: <workspace>/eval-<id>/outputs/
|
||||
- Save grading to: <workspace>/eval-<id>/grading.json
|
||||
```
|
||||
|
||||
**Without subagents**: Read `agents/grader.md` and follow the procedure directly — evaluate expectations against the transcript and outputs, then save grading.json.
|
||||
|
||||
After grading completes, finalize timing.json.
|
||||
|
||||
## Step 5: Display Results
|
||||
|
||||
Update task to `completed`. Display:
|
||||
- Pass/fail status for each expectation with evidence
|
||||
- Overall pass rate
|
||||
- Execution metrics from grading.json
|
||||
- Wall clock time from timing.json
|
||||
- **User notes summary**: Uncertainties, workarounds, and suggestions from the executor (may reveal issues even when expectations pass)
|
||||
|
||||
## Comparison Workflow
|
||||
|
||||
To compare skill-enabled vs no-skill performance:
|
||||
|
||||
```
|
||||
1. Prepare both runs (with --no-skill flag for baseline)
|
||||
2. Execute both (parallel executors)
|
||||
3. Grade both (parallel graders)
|
||||
4. Blind Compare outputs
|
||||
5. Report winner with analysis
|
||||
```
|
||||
@@ -1,190 +0,0 @@
|
||||
# Mode Workflow Diagrams
|
||||
|
||||
Visual representations of how each mode orchestrates building blocks.
|
||||
|
||||
## Quality Assessment (Eval Mode)
|
||||
|
||||
Measures how well a skill performs on its evals.
|
||||
|
||||
```
|
||||
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
||||
│ Executor │────▶│ Grader │────▶│ Aggregate │
|
||||
│ (N runs) │ │ (N runs) │ │ Results │
|
||||
└─────────────┘ └─────────────┘ └─────────────┘
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
transcript.md grading.json summary.json
|
||||
user_notes.md claims[]
|
||||
metrics.json user_notes_summary
|
||||
```
|
||||
|
||||
Use for: Testing skill performance, validating skill value.
|
||||
|
||||
## Skill Improvement (Improve Mode)
|
||||
|
||||
Iteratively improves a skill through blind comparison.
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ ITERATION LOOP │
|
||||
│ │
|
||||
│ ┌─────────┐ ┌─────────┐ ┌────────────┐ ┌──────────┐ │
|
||||
│ │Executor │──▶│ Grader │──▶│ Comparator │──▶│ Analyzer │ │
|
||||
│ │(3 runs) │ │(3 runs) │ │ (blind) │ │(post-hoc)│ │
|
||||
│ └─────────┘ └─────────┘ └────────────┘ └──────────┘ │
|
||||
│ │ │ │ │ │
|
||||
│ │ │ │ ▼ │
|
||||
│ │ │ │ suggestions │
|
||||
│ │ │ ▼ │ │
|
||||
│ │ │ winner A/B │ │
|
||||
│ │ ▼ │ │
|
||||
│ │ pass_rate │ │
|
||||
│ ▼ ▼ │
|
||||
│ transcript ┌──────────────────┐ │
|
||||
│ user_notes │ Apply to v(N+1) │ │
|
||||
│ └────────┬─────────┘ │
|
||||
│ │ │
|
||||
└───────────────────────────────────────────────────┼─────────────┘
|
||||
│
|
||||
(repeat until
|
||||
goal or timeout)
|
||||
```
|
||||
|
||||
Use for: Optimizing skill performance, iterating on skill instructions.
|
||||
|
||||
## A/B Testing (with vs without skill)
|
||||
|
||||
Compares skill-enabled vs no-skill performance.
|
||||
|
||||
```
|
||||
┌─────────────────┐
|
||||
│ Executor │──┐
|
||||
│ (with skill) │ │ ┌────────────┐ ┌─────────┐
|
||||
└─────────────────┘ ├────▶│ Comparator │────▶│ Report │
|
||||
┌─────────────────┐ │ │ (blind) │ │ winner │
|
||||
│ Executor │──┘ └────────────┘ └─────────┘
|
||||
│ (without skill) │ │
|
||||
└─────────────────┘ ▼
|
||||
rubric scores
|
||||
expectation results
|
||||
```
|
||||
|
||||
Use for: Proving skill adds value, measuring skill impact.
|
||||
|
||||
## Skill Creation (Create Mode)
|
||||
|
||||
Interactive skill development with user feedback.
|
||||
|
||||
```
|
||||
┌───────────────────────────────────────────────────────────────┐
|
||||
│ USER FEEDBACK LOOP │
|
||||
│ │
|
||||
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌──────────┐ │
|
||||
│ │ Interview │──▶│ Research │──▶│ Draft │──▶│ Run │ │
|
||||
│ │ User │ │ via MCPs │ │ SKILL.md │ │ Example │ │
|
||||
│ └───────────┘ └───────────┘ └───────────┘ └──────────┘ │
|
||||
│ ▲ │ │
|
||||
│ │ ▼ │
|
||||
│ │ user sees │
|
||||
│ │ transcript │
|
||||
│ │ │ │
|
||||
│ └───────────────────────────────────────────────┘ │
|
||||
│ refine │
|
||||
└───────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
(once criteria defined,
|
||||
transition to Improve)
|
||||
```
|
||||
|
||||
Use for: Creating new skills with tight user feedback.
|
||||
|
||||
## Skill Benchmark (Benchmark Mode)
|
||||
|
||||
Standardized performance measurement with variance analysis.
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────────┐
|
||||
│ FOR EACH EVAL │
|
||||
│ │
|
||||
│ ┌─────────────────────────────────┐ ┌─────────────────────────────┐│
|
||||
│ │ WITH SKILL (3x) │ │ WITHOUT SKILL (3x) ││
|
||||
│ │ ┌─────────┐ ┌─────────┐ │ │ ┌─────────┐ ┌─────────┐ ││
|
||||
│ │ │Executor │ │ Grader │ ───┐ │ │ │Executor │ │ Grader │───┐││
|
||||
│ │ │ run 1 │→│ run 1 │ │ │ │ │ run 1 │→│ run 1 │ │││
|
||||
│ │ └─────────┘ └─────────┘ │ │ │ └─────────┘ └─────────┘ │││
|
||||
│ │ ┌─────────┐ ┌─────────┐ │ │ │ ┌─────────┐ ┌─────────┐ │││
|
||||
│ │ │Executor │ │ Grader │ ───┼───│──│──│Executor │ │ Grader │───┼││
|
||||
│ │ │ run 2 │→│ run 2 │ │ │ │ │ run 2 │→│ run 2 │ │││
|
||||
│ │ └─────────┘ └─────────┘ │ │ │ └─────────┘ └─────────┘ │││
|
||||
│ │ ┌─────────┐ ┌─────────┐ │ │ │ ┌─────────┐ ┌─────────┐ │││
|
||||
│ │ │Executor │ │ Grader │ ───┘ │ │ │Executor │ │ Grader │───┘││
|
||||
│ │ │ run 3 │→│ run 3 │ │ │ │ run 3 │→│ run 3 │ ││
|
||||
│ │ └─────────┘ └─────────┘ │ │ └─────────┘ └─────────┘ ││
|
||||
│ └─────────────────────────────────┘ └─────────────────────────────┘│
|
||||
│ │ │ │
|
||||
│ └────────┬───────────┘ │
|
||||
│ ▼ │
|
||||
│ ┌─────────────────┐ │
|
||||
│ │ Analyzer │ │
|
||||
│ │ (most capable) │ │
|
||||
│ └────────┬────────┘ │
|
||||
│ │ │
|
||||
└───────────────────────────────────────┼──────────────────────────────┘
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ benchmark.json │
|
||||
│ benchmark.md │
|
||||
└─────────────────┘
|
||||
```
|
||||
|
||||
Captures per-run: pass_rate, time_seconds, tokens, tool_calls, notes
|
||||
Aggregates: mean, stddev, min, max for each metric across configurations
|
||||
Analyzer surfaces: patterns, anomalies, and freeform observations
|
||||
|
||||
Use for: Understanding skill performance, comparing across models, tracking regressions.
|
||||
|
||||
---
|
||||
|
||||
## Inline Workflows (Without Subagents)
|
||||
|
||||
When subagents aren't available, the same building blocks execute sequentially in the main loop.
|
||||
|
||||
### Eval Mode (Inline)
|
||||
|
||||
```
|
||||
┌───────────────────────────────────────────────────────┐
|
||||
│ MAIN LOOP │
|
||||
│ │
|
||||
│ Read executor.md → Execute eval → Save outputs │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ Read grader.md → Grade expectations → Save grading │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ Display results │
|
||||
└───────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Improve Mode (Inline)
|
||||
|
||||
```
|
||||
┌───────────────────────────────────────────────────────┐
|
||||
│ ITERATION LOOP │
|
||||
│ │
|
||||
│ Read executor.md → Execute (1 run) → Save outputs │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ Read grader.md → Grade expectations → Save grading │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ Compare with previous best (inline, not blind) │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ Analyze differences → Apply improvements to v(N+1) │
|
||||
│ │ │
|
||||
│ (repeat until goal or timeout) │
|
||||
└───────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
Benchmark mode requires subagents and has no inline equivalent.
|
||||
@@ -1,32 +1,6 @@
|
||||
# JSON Schemas
|
||||
|
||||
This document defines the JSON schemas used by skill-creator-edge.
|
||||
|
||||
## Working with JSON Files
|
||||
|
||||
### Initialize a new file with correct structure
|
||||
|
||||
```bash
|
||||
scripts/init_json.py <type> <output-path>
|
||||
|
||||
# Examples:
|
||||
scripts/init_json.py evals evals/evals.json
|
||||
scripts/init_json.py grading run-1/grading.json
|
||||
scripts/init_json.py benchmark benchmarks/2026-01-15/benchmark.json
|
||||
scripts/init_json.py metrics run-1/outputs/metrics.json
|
||||
```
|
||||
|
||||
### Validate an existing file
|
||||
|
||||
```bash
|
||||
scripts/validate_json.py <file-path> [--type <type>]
|
||||
|
||||
# Examples:
|
||||
scripts/validate_json.py evals/evals.json
|
||||
scripts/validate_json.py run-1/grading.json --type grading
|
||||
```
|
||||
|
||||
The validator infers the type from the filename when possible.
|
||||
This document defines the JSON schemas used by skill-creator.
|
||||
|
||||
---
|
||||
|
||||
@@ -224,15 +198,19 @@ Output from the executor agent. Located at `<run-dir>/outputs/metrics.json`.
|
||||
|
||||
Wall clock timing for a run. Located at `<run-dir>/timing.json`.
|
||||
|
||||
**How to capture:** When a subagent task completes, the task notification includes `total_tokens` and `duration_ms`. Save these immediately — they are not persisted anywhere else and cannot be recovered after the fact.
|
||||
|
||||
```json
|
||||
{
|
||||
"total_tokens": 84852,
|
||||
"duration_ms": 23332,
|
||||
"total_duration_seconds": 23.3,
|
||||
"executor_start": "2026-01-15T10:30:00Z",
|
||||
"executor_end": "2026-01-15T10:32:45Z",
|
||||
"executor_duration_seconds": 165.0,
|
||||
"grader_start": "2026-01-15T10:32:46Z",
|
||||
"grader_end": "2026-01-15T10:33:12Z",
|
||||
"grader_duration_seconds": 26.0,
|
||||
"total_duration_seconds": 191.0
|
||||
"grader_duration_seconds": 26.0
|
||||
}
|
||||
```
|
||||
|
||||
@@ -257,6 +235,7 @@ Output from Benchmark mode. Located at `benchmarks/<timestamp>/benchmark.json`.
|
||||
"runs": [
|
||||
{
|
||||
"eval_id": 1,
|
||||
"eval_name": "Ocean",
|
||||
"configuration": "with_skill",
|
||||
"run_number": 1,
|
||||
"result": {
|
||||
@@ -308,10 +287,23 @@ Output from Benchmark mode. Located at `benchmarks/<timestamp>/benchmark.json`.
|
||||
|
||||
**Fields:**
|
||||
- `metadata`: Information about the benchmark run
|
||||
- `runs[]`: Individual run results with expectations and notes
|
||||
- `skill_name`: Name of the skill
|
||||
- `timestamp`: When the benchmark was run
|
||||
- `evals_run`: List of eval names or IDs
|
||||
- `runs_per_configuration`: Number of runs per config (e.g. 3)
|
||||
- `runs[]`: Individual run results
|
||||
- `eval_id`: Numeric eval identifier
|
||||
- `eval_name`: Human-readable eval name (used as section header in the viewer)
|
||||
- `configuration`: Must be `"with_skill"` or `"without_skill"` (the viewer uses this exact string for grouping and color coding)
|
||||
- `run_number`: Integer run number (1, 2, 3...)
|
||||
- `result`: Nested object with `pass_rate`, `passed`, `total`, `time_seconds`, `tokens`, `errors`
|
||||
- `run_summary`: Statistical aggregates per configuration
|
||||
- `with_skill` / `without_skill`: Each contains `pass_rate`, `time_seconds`, `tokens` objects with `mean` and `stddev` fields
|
||||
- `delta`: Difference strings like `"+0.50"`, `"+13.0"`, `"+1700"`
|
||||
- `notes`: Freeform observations from the analyzer
|
||||
|
||||
**Important:** The viewer reads these field names exactly. Using `config` instead of `configuration`, or putting `pass_rate` at the top level of a run instead of nested under `result`, will cause the viewer to show empty/zero values. Always reference this schema when generating benchmark.json manually.
|
||||
|
||||
---
|
||||
|
||||
## comparison.json
|
||||
|
||||
Reference in New Issue
Block a user