Add skill-creator plugin

2026-03-19 23:23:07 +00:00 · 2026-02-17 17:02:51 -08:00
parent 261ce4fba4
commit 30975e61e3
20 changed files with 4879 additions and 0 deletions
--- a/plugins/skill-creator/skills/skill-creator/references/benchmark-mode.md
+++ b/plugins/skill-creator/skills/skill-creator/references/benchmark-mode.md
@@ -0,0 +1,149 @@
+# Benchmark Mode Reference
+
+**Requires subagents.** Benchmark mode relies on parallel execution of many independent runs. Without subagents, use Eval mode for individual eval testing instead.
+
+Benchmark mode runs a standardized, opinionated evaluation of a skill. It answers: **"How well does this skill perform?"**
+
+Unlike Eval mode (which runs individual evals), Benchmark mode:
+- Runs **all evals** (or a user-specified subset)
+- Runs each eval **3 times per configuration** (with_skill and without_skill) for variance
+- Captures **metrics with variance**: pass_rate, time_seconds, tokens
+- Uses the **most capable model** as analyzer to surface patterns and anomalies
+- Produces **persistent, structured output** for cross-run analysis
+
+## When to Use
+
+- **Understanding performance**: "How does my skill perform?"
+- **Cross-model comparison**: "How does this skill perform across different models?"
+- **Regression detection**: Compare benchmark results over time
+- **Skill validation**: Does the skill actually add value over no-skill baseline?
+
+## Defaults
+
+**Always include no-skill baseline.** Every benchmark runs both `with_skill` and `without_skill` configurations. This measures the value the skill adds - without a baseline, you can't know if the skill helps.
+
+**Suggest models for comparison.** If the user wants to compare across models, suggest a couple of commonly available models in your environment. Don't hardcode model names - just recommend what's commonly used and available.
+
+**Run on current model by default.** If the user doesn't specify, run on whatever model is currently active. For cross-model comparison, ask which models they'd like to test.
+
+## Terminology
+
+| Term | Definition |
+|------|------------|
+| **Run** | A single execution of a skill on an eval prompt |
+| **Configuration** | The experimental condition: `with_skill` or `without_skill` |
+| **RunResult** | Graded output of a run: expectations, metrics, notes |
+| **Run Summary** | Statistical aggregates across runs: mean, stddev, min, max |
+| **Notes** | Freeform observations from the analyzer |
+
+## Workflow
+
+```
+1. Setup
+   → Choose workspace location (ask user, suggest <skill>-workspace/)
+   → Verify evals exist
+   → Determine which evals to run (all by default, or user subset)
+
+2. Execute runs (parallel where possible)
+   → For each eval:
+       → 3 runs with_skill configuration
+       → 3 runs without_skill configuration
+   → Each run captures: transcript, outputs, metrics
+   → Coordinator extracts from Task result: total_tokens, tool_uses, duration_ms
+
+3. Grade runs (parallel)
+   → Spawn grader for each run
+   → Produces: expectations with pass/fail, notes
+
+4. Aggregate results
+   → Calculate run_summary per configuration:
+       → pass_rate: mean, stddev, min, max
+       → time_seconds: mean, stddev, min, max
+       → tokens: mean, stddev, min, max
+   → Calculate delta between configurations
+
+5. Analyze (most capable model)
+   → Review all results
+   → Surface patterns, anomalies, observations as freeform notes
+   → Examples:
+       - "Assertion X passes 100% in both configurations - may not differentiate skill value"
+       - "Eval 3 shows high variance (50% ± 40%) - may be flaky"
+       - "Skill adds 13s average time but improves pass rate by 50%"
+
+6. Generate benchmark
+   → benchmark.json - Structured data for analysis
+   → benchmark.md - Human-readable summary
+```
+
+## Spawning Executors
+
+Run executor subagents in the background for parallelism. When each agent completes, capture the execution metrics (tokens consumed, tool calls, duration) from the completion notification.
+
+For example, in Claude Code, background subagents deliver a `<task-notification>` with a `<usage>` block:
+
+```xml
+<!-- Example: Claude Code task notification format -->
+<task-notification>
+<task-id>...</task-id>
+<status>completed</status>
+<result>agent's actual text output</result>
+<usage>total_tokens: 3700
+tool_uses: 2
+duration_ms: 32400</usage>
+</task-notification>
+```
+
+Extract from each completed executor's metrics:
+- **total_tokens** = total tokens consumed (input + output combined)
+- **tool_uses** = number of tool calls made
+- **duration_ms** / 1000 = execution time in seconds
+
+The exact format of completion notifications varies by environment — look for token counts, tool call counts, and duration in whatever format your environment provides.
+
+Record these per-run metrics alongside the grading results. The aggregate script can then compute mean/stddev/min/max across runs for each configuration.
+
+## Scripts
+
+Use these scripts at specific points in the workflow:
+
+### After Grading (Step 4: Aggregate)
+
+```bash
+# Aggregate all grading.json files into benchmark summary
+scripts/aggregate_benchmark.py <benchmark-dir> --skill-name <name> --skill-path <path>
+```
+
+This reads `grading.json` from each run directory and produces:
+- `benchmark.json` - Structured results with run_summary statistics
+- `benchmark.md` - Human-readable summary table
+
+### Validation
+
+```bash
+# Validate benchmark output
+scripts/validate_json.py <benchmark-dir>/benchmark.json
+
+# Validate individual grading files
+scripts/validate_json.py <run-dir>/grading.json --type grading
+```
+
+### Initialize Templates
+
+```bash
+# Create empty benchmark.json with correct structure (if not using aggregate script)
+scripts/init_json.py benchmark <benchmark-dir>/benchmark.json
+```
+
+## Analyzer Instructions
+
+The analyzer (always most capable) reviews all results and generates freeform notes. See `agents/analyzer.md` for the full prompt, but key responsibilities:
+
+1. **Compare configurations**: Which performed better? By how much?
+2. **Identify patterns**: Assertions that always pass/fail, high variance, etc.
+3. **Surface anomalies**: Unexpected results, broken evals, regressions
+4. **Provide context**: Why might these patterns exist?
+
+The analyzer should NOT:
+- Suggest improvements to the skill (that's Improve mode)
+- Make subjective quality judgments beyond the data
+- Speculate without evidence
--- a/plugins/skill-creator/skills/skill-creator/references/eval-mode.md
+++ b/plugins/skill-creator/skills/skill-creator/references/eval-mode.md
@@ -0,0 +1,144 @@
+# Eval Mode Reference
+
+Eval mode runs skill evals and grades expectations. Enables measuring skill performance, comparing with/without skill, and validating that skills add value.
+
+## Purpose
+
+Evals serve to:
+1. **Set a floor** - Prove the skill helps Claude do something it couldn't by default
+2. **Raise the ceiling** - Enable iterating on skills to improve performance
+3. **Measure holistically** - Capture metrics beyond pass/fail (time, tokens)
+4. **Understand cross-model behavior** - Test skills across different models
+
+## Eval Workflow
+
+```
+0. Choose Workspace Location
+   → Ask user where to put workspace, suggest sensible default
+
+1. Check Dependencies
+   → Scan skill for dependencies, confirm availability with user
+
+2. Prepare (scripts/prepare_eval.py)
+   → Create task, copies skill, stages files
+
+3. Execute (agents/executor.md)
+   → Update task to implementing, spawn executor sub-agent
+   → Executor reads skill, runs prompt, saves transcript
+
+4. Grade (agents/grader.md)
+   → Update task to reviewing, spawn grader sub-agent
+   → Grader reads transcript + outputs, evaluates expectations
+
+5. Complete task, display results
+   → Pass/fail per expectation, overall pass rate, metrics
+```
+
+## Step 0: Setup
+
+**Before running any evals, read the output schemas:**
+
+```bash
+# Read to understand the JSON structures you'll produce
+Read references/schemas.md
+```
+
+This ensures you know the expected structure for:
+- `grading.json` - What the grader produces
+- `metrics.json` - What the executor produces
+- `timing.json` - Wall clock timing format
+
+**Choose workspace location:**
+
+1. **Suggest default**: `<skill-name>-workspace/` as a sibling to the skill directory
+2. **Ask the user** using AskUserQuestion — if the workspace is inside a git repo, suggest adding it to `.gitignore`
+3. **Create the workspace directory** once confirmed
+
+## Step 1: Check Dependencies
+
+Before running evals, scan the skill for dependencies:
+
+1. Read SKILL.md (including `compatibility` frontmatter field)
+2. Check referenced scripts for required tools
+3. Present to user and confirm availability
+
+## Step 2: Prepare and Create Task
+
+Run prepare script and create task:
+
+```bash
+scripts/prepare_eval.py <skill-path> <eval-id> --output-dir <workspace>/eval-<id>/
+```
+
+```python
+task = TaskCreate(
+    subject=f"Eval {eval_id}"
+)
+TaskUpdate(task, status="planning")
+```
+
+## Step 3: Execute
+
+Update task to `implementing` and run the executor:
+
+```bash
+echo "{\"executor_start\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\"}" > <run-dir>/timing.json
+```
+
+**With subagents**: Spawn an executor subagent with these instructions:
+
+```
+Read agents/executor.md at: <skill-creator-path>/agents/executor.md
+
+Execute this eval:
+- Skill path: <workspace>/skill/
+- Prompt: <eval prompt from eval_metadata.json>
+- Input files: <workspace>/eval-<id>/inputs/
+- Save transcript to: <workspace>/eval-<id>/transcript.md
+- Save outputs to: <workspace>/eval-<id>/outputs/
+```
+
+**Without subagents**: Read `agents/executor.md` and follow the procedure directly — execute the eval, save the transcript, and produce outputs inline.
+
+After execution completes, update timing.json with executor_end and duration.
+
+## Step 4: Grade
+
+Update task to `reviewing` and run the grader:
+
+**With subagents**: Spawn a grader subagent with these instructions:
+
+```
+Read agents/grader.md at: <skill-creator-path>/agents/grader.md
+
+Grade these expectations:
+- Assertions: <list from eval_metadata.json>
+- Transcript: <workspace>/eval-<id>/transcript.md
+- Outputs: <workspace>/eval-<id>/outputs/
+- Save grading to: <workspace>/eval-<id>/grading.json
+```
+
+**Without subagents**: Read `agents/grader.md` and follow the procedure directly — evaluate expectations against the transcript and outputs, then save grading.json.
+
+After grading completes, finalize timing.json.
+
+## Step 5: Display Results
+
+Update task to `completed`. Display:
+- Pass/fail status for each expectation with evidence
+- Overall pass rate
+- Execution metrics from grading.json
+- Wall clock time from timing.json
+- **User notes summary**: Uncertainties, workarounds, and suggestions from the executor (may reveal issues even when expectations pass)
+
+## Comparison Workflow
+
+To compare skill-enabled vs no-skill performance:
+
+```
+1. Prepare both runs (with --no-skill flag for baseline)
+2. Execute both (parallel executors)
+3. Grade both (parallel graders)
+4. Blind Compare outputs
+5. Report winner with analysis
+```
--- a/plugins/skill-creator/skills/skill-creator/references/mode-diagrams.md
+++ b/plugins/skill-creator/skills/skill-creator/references/mode-diagrams.md
@@ -0,0 +1,190 @@
+# Mode Workflow Diagrams
+
+Visual representations of how each mode orchestrates building blocks.
+
+## Quality Assessment (Eval Mode)
+
+Measures how well a skill performs on its evals.
+
+```
+┌─────────────┐     ┌─────────────┐     ┌─────────────┐
+│  Executor   │────▶│   Grader    │────▶│  Aggregate  │
+│  (N runs)   │     │  (N runs)   │     │   Results   │
+└─────────────┘     └─────────────┘     └─────────────┘
+      │                   │                    │
+      ▼                   ▼                    ▼
+ transcript.md       grading.json         summary.json
+ user_notes.md       claims[]
+ metrics.json        user_notes_summary
+```
+
+Use for: Testing skill performance, validating skill value.
+
+## Skill Improvement (Improve Mode)
+
+Iteratively improves a skill through blind comparison.
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                        ITERATION LOOP                           │
+│                                                                 │
+│  ┌─────────┐   ┌─────────┐   ┌────────────┐   ┌──────────┐      │
+│  │Executor │──▶│ Grader  │──▶│ Comparator │──▶│ Analyzer │      │
+│  │(3 runs) │   │(3 runs) │   │  (blind)   │   │(post-hoc)│      │
+│  └─────────┘   └─────────┘   └────────────┘   └──────────┘      │
+│       │             │              │                │           │
+│       │             │              │                ▼           │
+│       │             │              │         suggestions        │
+│       │             │              ▼                │           │
+│       │             │         winner A/B            │           │
+│       │             ▼                               │           │
+│       │        pass_rate                            │           │
+│       ▼                                             ▼           │
+│  transcript                              ┌──────────────────┐   │
+│  user_notes                              │ Apply to v(N+1)  │   │
+│                                          └────────┬─────────┘   │
+│                                                   │             │
+└───────────────────────────────────────────────────┼─────────────┘
+                                                    │
+                                              (repeat until
+                                               goal or timeout)
+```
+
+Use for: Optimizing skill performance, iterating on skill instructions.
+
+## A/B Testing (with vs without skill)
+
+Compares skill-enabled vs no-skill performance.
+
+```
+┌─────────────────┐
+│ Executor        │──┐
+│ (with skill)    │  │     ┌────────────┐     ┌─────────┐
+└─────────────────┘  ├────▶│ Comparator │────▶│ Report  │
+┌─────────────────┐  │     │  (blind)   │     │ winner  │
+│ Executor        │──┘     └────────────┘     └─────────┘
+│ (without skill) │              │
+└─────────────────┘              ▼
+                            rubric scores
+                            expectation results
+```
+
+Use for: Proving skill adds value, measuring skill impact.
+
+## Skill Creation (Create Mode)
+
+Interactive skill development with user feedback.
+
+```
+┌───────────────────────────────────────────────────────────────┐
+│                    USER FEEDBACK LOOP                         │
+│                                                               │
+│  ┌───────────┐   ┌───────────┐   ┌───────────┐   ┌──────────┐ │
+│  │ Interview │──▶│  Research │──▶│   Draft   │──▶│   Run    │ │
+│  │   User    │   │ via MCPs  │   │  SKILL.md │   │ Example  │ │
+│  └───────────┘   └───────────┘   └───────────┘   └──────────┘ │
+│       ▲                                               │       │
+│       │                                               ▼       │
+│       │                                          user sees    │
+│       │                                          transcript   │
+│       │                                               │       │
+│       └───────────────────────────────────────────────┘       │
+│                         refine                                │
+└───────────────────────────────────────────────────────────────┘
+                              │
+                              ▼
+                    (once criteria defined,
+                     transition to Improve)
+```
+
+Use for: Creating new skills with tight user feedback.
+
+## Skill Benchmark (Benchmark Mode)
+
+Standardized performance measurement with variance analysis.
+
+```
+┌──────────────────────────────────────────────────────────────────────┐
+│                         FOR EACH EVAL                                │
+│                                                                      │
+│  ┌─────────────────────────────────┐  ┌─────────────────────────────┐│
+│  │        WITH SKILL (3x)          │  │      WITHOUT SKILL (3x)     ││
+│  │  ┌─────────┐ ┌─────────┐        │  │  ┌─────────┐ ┌─────────┐    ││
+│  │  │Executor │ │ Grader  │ ───┐   │  │  │Executor │ │ Grader  │───┐││
+│  │  │ run 1   │→│ run 1   │    │   │  │  │ run 1   │→│ run 1   │   │││
+│  │  └─────────┘ └─────────┘    │   │  │  └─────────┘ └─────────┘   │││
+│  │  ┌─────────┐ ┌─────────┐    │   │  │  ┌─────────┐ ┌─────────┐   │││
+│  │  │Executor │ │ Grader  │ ───┼───│──│──│Executor │ │ Grader  │───┼││
+│  │  │ run 2   │→│ run 2   │    │   │  │  │ run 2   │→│ run 2   │   │││
+│  │  └─────────┘ └─────────┘    │   │  │  └─────────┘ └─────────┘   │││
+│  │  ┌─────────┐ ┌─────────┐    │   │  │  ┌─────────┐ ┌─────────┐   │││
+│  │  │Executor │ │ Grader  │ ───┘   │  │  │Executor │ │ Grader  │───┘││
+│  │  │ run 3   │→│ run 3   │        │  │  │ run 3   │→│ run 3   │    ││
+│  │  └─────────┘ └─────────┘        │  │  └─────────┘ └─────────┘    ││
+│  └─────────────────────────────────┘  └─────────────────────────────┘│
+│                              │                    │                  │
+│                              └────────┬───────────┘                  │
+│                                       ▼                              │
+│                              ┌─────────────────┐                     │
+│                              │    Analyzer     │                     │
+│                              │  (most capable) │                     │
+│                              └────────┬────────┘                     │
+│                                       │                              │
+└───────────────────────────────────────┼──────────────────────────────┘
+                                        ▼
+                              ┌─────────────────┐
+                              │  benchmark.json │
+                              │  benchmark.md   │
+                              └─────────────────┘
+```
+
+Captures per-run: pass_rate, time_seconds, tokens, tool_calls, notes
+Aggregates: mean, stddev, min, max for each metric across configurations
+Analyzer surfaces: patterns, anomalies, and freeform observations
+
+Use for: Understanding skill performance, comparing across models, tracking regressions.
+
+---
+
+## Inline Workflows (Without Subagents)
+
+When subagents aren't available, the same building blocks execute sequentially in the main loop.
+
+### Eval Mode (Inline)
+
+```
+┌───────────────────────────────────────────────────────┐
+│                  MAIN LOOP                            │
+│                                                       │
+│  Read executor.md → Execute eval → Save outputs       │
+│         │                                             │
+│         ▼                                             │
+│  Read grader.md → Grade expectations → Save grading   │
+│         │                                             │
+│         ▼                                             │
+│  Display results                                      │
+└───────────────────────────────────────────────────────┘
+```
+
+### Improve Mode (Inline)
+
+```
+┌───────────────────────────────────────────────────────┐
+│                  ITERATION LOOP                       │
+│                                                       │
+│  Read executor.md → Execute (1 run) → Save outputs    │
+│         │                                             │
+│         ▼                                             │
+│  Read grader.md → Grade expectations → Save grading   │
+│         │                                             │
+│         ▼                                             │
+│  Compare with previous best (inline, not blind)       │
+│         │                                             │
+│         ▼                                             │
+│  Analyze differences → Apply improvements to v(N+1)   │
+│         │                                             │
+│         (repeat until goal or timeout)                │
+└───────────────────────────────────────────────────────┘
+```
+
+Benchmark mode requires subagents and has no inline equivalent.
--- a/plugins/skill-creator/skills/skill-creator/references/schemas.md
+++ b/plugins/skill-creator/skills/skill-creator/references/schemas.md
@@ -0,0 +1,438 @@
+# JSON Schemas
+
+This document defines the JSON schemas used by skill-creator-edge.
+
+## Working with JSON Files
+
+### Initialize a new file with correct structure
+
+```bash
+scripts/init_json.py <type> <output-path>
+
+# Examples:
+scripts/init_json.py evals evals/evals.json
+scripts/init_json.py grading run-1/grading.json
+scripts/init_json.py benchmark benchmarks/2026-01-15/benchmark.json
+scripts/init_json.py metrics run-1/outputs/metrics.json
+```
+
+### Validate an existing file
+
+```bash
+scripts/validate_json.py <file-path> [--type <type>]
+
+# Examples:
+scripts/validate_json.py evals/evals.json
+scripts/validate_json.py run-1/grading.json --type grading
+```
+
+The validator infers the type from the filename when possible.
+
+---
+
+## evals.json
+
+Defines the evals for a skill. Located at `evals/evals.json` within the skill directory.
+
+```json
+{
+  "skill_name": "example-skill",
+  "evals": [
+    {
+      "id": 1,
+      "prompt": "User's example prompt",
+      "expected_output": "Description of expected result",
+      "files": ["evals/files/sample1.pdf"],
+      "expectations": [
+        "The output includes X",
+        "The skill used script Y"
+      ]
+    }
+  ]
+}
+```
+
+**Fields:**
+- `skill_name`: Name matching the skill's frontmatter
+- `evals[].id`: Unique integer identifier
+- `evals[].prompt`: The task to execute
+- `evals[].expected_output`: Human-readable description of success
+- `evals[].files`: Optional list of input file paths (relative to skill root)
+- `evals[].expectations`: List of verifiable statements
+
+---
+
+## history.json
+
+Tracks version progression in Improve mode. Located at workspace root.
+
+```json
+{
+  "started_at": "2026-01-15T10:30:00Z",
+  "skill_name": "pdf",
+  "current_best": "v2",
+  "iterations": [
+    {
+      "version": "v0",
+      "parent": null,
+      "expectation_pass_rate": 0.65,
+      "grading_result": "baseline",
+      "is_current_best": false
+    },
+    {
+      "version": "v1",
+      "parent": "v0",
+      "expectation_pass_rate": 0.75,
+      "grading_result": "won",
+      "is_current_best": false
+    },
+    {
+      "version": "v2",
+      "parent": "v1",
+      "expectation_pass_rate": 0.85,
+      "grading_result": "won",
+      "is_current_best": true
+    }
+  ]
+}
+```
+
+**Fields:**
+- `started_at`: ISO timestamp of when improvement started
+- `skill_name`: Name of the skill being improved
+- `current_best`: Version identifier of the best performer
+- `iterations[].version`: Version identifier (v0, v1, ...)
+- `iterations[].parent`: Parent version this was derived from
+- `iterations[].expectation_pass_rate`: Pass rate from grading
+- `iterations[].grading_result`: "baseline", "won", "lost", or "tie"
+- `iterations[].is_current_best`: Whether this is the current best version
+
+---
+
+## grading.json
+
+Output from the grader agent. Located at `<run-dir>/grading.json`.
+
+```json
+{
+  "expectations": [
+    {
+      "text": "The output includes the name 'John Smith'",
+      "passed": true,
+      "evidence": "Found in transcript Step 3: 'Extracted names: John Smith, Sarah Johnson'"
+    },
+    {
+      "text": "The spreadsheet has a SUM formula in cell B10",
+      "passed": false,
+      "evidence": "No spreadsheet was created. The output was a text file."
+    }
+  ],
+  "summary": {
+    "passed": 2,
+    "failed": 1,
+    "total": 3,
+    "pass_rate": 0.67
+  },
+  "execution_metrics": {
+    "tool_calls": {
+      "Read": 5,
+      "Write": 2,
+      "Bash": 8
+    },
+    "total_tool_calls": 15,
+    "total_steps": 6,
+    "errors_encountered": 0,
+    "output_chars": 12450,
+    "transcript_chars": 3200
+  },
+  "timing": {
+    "executor_duration_seconds": 165.0,
+    "grader_duration_seconds": 26.0,
+    "total_duration_seconds": 191.0
+  },
+  "claims": [
+    {
+      "claim": "The form has 12 fillable fields",
+      "type": "factual",
+      "verified": true,
+      "evidence": "Counted 12 fields in field_info.json"
+    }
+  ],
+  "user_notes_summary": {
+    "uncertainties": ["Used 2023 data, may be stale"],
+    "needs_review": [],
+    "workarounds": ["Fell back to text overlay for non-fillable fields"]
+  },
+  "eval_feedback": {
+    "suggestions": [
+      {
+        "assertion": "The output includes the name 'John Smith'",
+        "reason": "A hallucinated document that mentions the name would also pass"
+      }
+    ],
+    "overall": "Assertions check presence but not correctness."
+  }
+}
+```
+
+**Fields:**
+- `expectations[]`: Graded expectations with evidence
+- `summary`: Aggregate pass/fail counts
+- `execution_metrics`: Tool usage and output size (from executor's metrics.json)
+- `timing`: Wall clock timing (from timing.json)
+- `claims`: Extracted and verified claims from the output
+- `user_notes_summary`: Issues flagged by the executor
+- `eval_feedback`: (optional) Improvement suggestions for the evals, only present when the grader identifies issues worth raising
+
+---
+
+## metrics.json
+
+Output from the executor agent. Located at `<run-dir>/outputs/metrics.json`.
+
+```json
+{
+  "tool_calls": {
+    "Read": 5,
+    "Write": 2,
+    "Bash": 8,
+    "Edit": 1,
+    "Glob": 2,
+    "Grep": 0
+  },
+  "total_tool_calls": 18,
+  "total_steps": 6,
+  "files_created": ["filled_form.pdf", "field_values.json"],
+  "errors_encountered": 0,
+  "output_chars": 12450,
+  "transcript_chars": 3200
+}
+```
+
+**Fields:**
+- `tool_calls`: Count per tool type
+- `total_tool_calls`: Sum of all tool calls
+- `total_steps`: Number of major execution steps
+- `files_created`: List of output files created
+- `errors_encountered`: Number of errors during execution
+- `output_chars`: Total character count of output files
+- `transcript_chars`: Character count of transcript
+
+---
+
+## timing.json
+
+Wall clock timing for a run. Located at `<run-dir>/timing.json`.
+
+```json
+{
+  "executor_start": "2026-01-15T10:30:00Z",
+  "executor_end": "2026-01-15T10:32:45Z",
+  "executor_duration_seconds": 165.0,
+  "grader_start": "2026-01-15T10:32:46Z",
+  "grader_end": "2026-01-15T10:33:12Z",
+  "grader_duration_seconds": 26.0,
+  "total_duration_seconds": 191.0
+}
+```
+
+---
+
+## benchmark.json
+
+Output from Benchmark mode. Located at `benchmarks/<timestamp>/benchmark.json`.
+
+```json
+{
+  "metadata": {
+    "skill_name": "pdf",
+    "skill_path": "/path/to/pdf",
+    "executor_model": "claude-sonnet-4-20250514",
+    "analyzer_model": "most-capable-model",
+    "timestamp": "2026-01-15T10:30:00Z",
+    "evals_run": [1, 2, 3],
+    "runs_per_configuration": 3
+  },
+
+  "runs": [
+    {
+      "eval_id": 1,
+      "configuration": "with_skill",
+      "run_number": 1,
+      "result": {
+        "pass_rate": 0.85,
+        "passed": 6,
+        "failed": 1,
+        "total": 7,
+        "time_seconds": 42.5,
+        "tokens": 3800,
+        "tool_calls": 18,
+        "errors": 0
+      },
+      "expectations": [
+        {"text": "...", "passed": true, "evidence": "..."}
+      ],
+      "notes": [
+        "Used 2023 data, may be stale",
+        "Fell back to text overlay for non-fillable fields"
+      ]
+    }
+  ],
+
+  "run_summary": {
+    "with_skill": {
+      "pass_rate": {"mean": 0.85, "stddev": 0.05, "min": 0.80, "max": 0.90},
+      "time_seconds": {"mean": 45.0, "stddev": 12.0, "min": 32.0, "max": 58.0},
+      "tokens": {"mean": 3800, "stddev": 400, "min": 3200, "max": 4100}
+    },
+    "without_skill": {
+      "pass_rate": {"mean": 0.35, "stddev": 0.08, "min": 0.28, "max": 0.45},
+      "time_seconds": {"mean": 32.0, "stddev": 8.0, "min": 24.0, "max": 42.0},
+      "tokens": {"mean": 2100, "stddev": 300, "min": 1800, "max": 2500}
+    },
+    "delta": {
+      "pass_rate": "+0.50",
+      "time_seconds": "+13.0",
+      "tokens": "+1700"
+    }
+  },
+
+  "notes": [
+    "Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value",
+    "Eval 3 shows high variance (50% ± 40%) - may be flaky or model-dependent",
+    "Without-skill runs consistently fail on table extraction expectations",
+    "Skill adds 13s average execution time but improves pass rate by 50%"
+  ]
+}
+```
+
+**Fields:**
+- `metadata`: Information about the benchmark run
+- `runs[]`: Individual run results with expectations and notes
+- `run_summary`: Statistical aggregates per configuration
+- `notes`: Freeform observations from the analyzer
+
+---
+
+## comparison.json
+
+Output from blind comparator. Located at `<grading-dir>/comparison-N.json`.
+
+```json
+{
+  "winner": "A",
+  "reasoning": "Output A provides a complete solution with proper formatting and all required fields. Output B is missing the date field and has formatting inconsistencies.",
+  "rubric": {
+    "A": {
+      "content": {
+        "correctness": 5,
+        "completeness": 5,
+        "accuracy": 4
+      },
+      "structure": {
+        "organization": 4,
+        "formatting": 5,
+        "usability": 4
+      },
+      "content_score": 4.7,
+      "structure_score": 4.3,
+      "overall_score": 9.0
+    },
+    "B": {
+      "content": {
+        "correctness": 3,
+        "completeness": 2,
+        "accuracy": 3
+      },
+      "structure": {
+        "organization": 3,
+        "formatting": 2,
+        "usability": 3
+      },
+      "content_score": 2.7,
+      "structure_score": 2.7,
+      "overall_score": 5.4
+    }
+  },
+  "output_quality": {
+    "A": {
+      "score": 9,
+      "strengths": ["Complete solution", "Well-formatted", "All fields present"],
+      "weaknesses": ["Minor style inconsistency in header"]
+    },
+    "B": {
+      "score": 5,
+      "strengths": ["Readable output", "Correct basic structure"],
+      "weaknesses": ["Missing date field", "Formatting inconsistencies", "Partial data extraction"]
+    }
+  },
+  "expectation_results": {
+    "A": {
+      "passed": 4,
+      "total": 5,
+      "pass_rate": 0.80,
+      "details": [
+        {"text": "Output includes name", "passed": true}
+      ]
+    },
+    "B": {
+      "passed": 3,
+      "total": 5,
+      "pass_rate": 0.60,
+      "details": [
+        {"text": "Output includes name", "passed": true}
+      ]
+    }
+  }
+}
+```
+
+---
+
+## analysis.json
+
+Output from post-hoc analyzer. Located at `<grading-dir>/analysis.json`.
+
+```json
+{
+  "comparison_summary": {
+    "winner": "A",
+    "winner_skill": "path/to/winner/skill",
+    "loser_skill": "path/to/loser/skill",
+    "comparator_reasoning": "Brief summary of why comparator chose winner"
+  },
+  "winner_strengths": [
+    "Clear step-by-step instructions for handling multi-page documents",
+    "Included validation script that caught formatting errors"
+  ],
+  "loser_weaknesses": [
+    "Vague instruction 'process the document appropriately' led to inconsistent behavior",
+    "No script for validation, agent had to improvise"
+  ],
+  "instruction_following": {
+    "winner": {
+      "score": 9,
+      "issues": ["Minor: skipped optional logging step"]
+    },
+    "loser": {
+      "score": 6,
+      "issues": [
+        "Did not use the skill's formatting template",
+        "Invented own approach instead of following step 3"
+      ]
+    }
+  },
+  "improvement_suggestions": [
+    {
+      "priority": "high",
+      "category": "instructions",
+      "suggestion": "Replace 'process the document appropriately' with explicit steps",
+      "expected_impact": "Would eliminate ambiguity that caused inconsistent behavior"
+    }
+  ],
+  "transcript_insights": {
+    "winner_execution_pattern": "Read skill -> Followed 5-step process -> Used validation script",
+    "loser_execution_pattern": "Read skill -> Unclear on approach -> Tried 3 different methods"
+  }
+}
+```