chore(skill-creator): update to latest skill-creator

2026-03-19 23:23:07 +00:00 · 2026-02-24 17:10:46 -08:00
parent 99e11d9592
commit e05013d229
23 changed files with 3634 additions and 2847 deletions
--- a/plugins/skill-creator/skills/skill-creator/references/benchmark-mode.md
+++ b/plugins/skill-creator/skills/skill-creator/references/benchmark-mode.md
@@ -1,149 +0,0 @@
-# Benchmark Mode Reference
-
-**Requires subagents.** Benchmark mode relies on parallel execution of many independent runs. Without subagents, use Eval mode for individual eval testing instead.
-
-Benchmark mode runs a standardized, opinionated evaluation of a skill. It answers: **"How well does this skill perform?"**
-
-Unlike Eval mode (which runs individual evals), Benchmark mode:
- Runs **all evals** (or a user-specified subset)
- Runs each eval **3 times per configuration** (with_skill and without_skill) for variance
- Captures **metrics with variance**: pass_rate, time_seconds, tokens
- Uses the **most capable model** as analyzer to surface patterns and anomalies
- Produces **persistent, structured output** for cross-run analysis
-
-## When to Use
-
- **Understanding performance**: "How does my skill perform?"
- **Cross-model comparison**: "How does this skill perform across different models?"
- **Regression detection**: Compare benchmark results over time
- **Skill validation**: Does the skill actually add value over no-skill baseline?
-
-## Defaults
-
-**Always include no-skill baseline.** Every benchmark runs both `with_skill` and `without_skill` configurations. This measures the value the skill adds - without a baseline, you can't know if the skill helps.
-
-**Suggest models for comparison.** If the user wants to compare across models, suggest a couple of commonly available models in your environment. Don't hardcode model names - just recommend what's commonly used and available.
-
-**Run on current model by default.** If the user doesn't specify, run on whatever model is currently active. For cross-model comparison, ask which models they'd like to test.
-
-## Terminology
-
-| Term | Definition |
-|------|------------|
-| **Run** | A single execution of a skill on an eval prompt |
-| **Configuration** | The experimental condition: `with_skill` or `without_skill` |
-| **RunResult** | Graded output of a run: expectations, metrics, notes |
-| **Run Summary** | Statistical aggregates across runs: mean, stddev, min, max |
-| **Notes** | Freeform observations from the analyzer |
-
-## Workflow
-
-```
-1. Setup
-   → Choose workspace location (ask user, suggest <skill>-workspace/)
-   → Verify evals exist
-   → Determine which evals to run (all by default, or user subset)
-
-2. Execute runs (parallel where possible)
-   → For each eval:
-       → 3 runs with_skill configuration
-       → 3 runs without_skill configuration
-   → Each run captures: transcript, outputs, metrics
-   → Coordinator extracts from Task result: total_tokens, tool_uses, duration_ms
-
-3. Grade runs (parallel)
-   → Spawn grader for each run
-   → Produces: expectations with pass/fail, notes
-
-4. Aggregate results
-   → Calculate run_summary per configuration:
-       → pass_rate: mean, stddev, min, max
-       → time_seconds: mean, stddev, min, max
-       → tokens: mean, stddev, min, max
-   → Calculate delta between configurations
-
-5. Analyze (most capable model)
-   → Review all results
-   → Surface patterns, anomalies, observations as freeform notes
-   → Examples:
-       - "Assertion X passes 100% in both configurations - may not differentiate skill value"
-       - "Eval 3 shows high variance (50% ± 40%) - may be flaky"
-       - "Skill adds 13s average time but improves pass rate by 50%"
-
-6. Generate benchmark
-   → benchmark.json - Structured data for analysis
-   → benchmark.md - Human-readable summary
-```
-
-## Spawning Executors
-
-Run executor subagents in the background for parallelism. When each agent completes, capture the execution metrics (tokens consumed, tool calls, duration) from the completion notification.
-
-For example, in Claude Code, background subagents deliver a `<task-notification>` with a `<usage>` block:
-
-```xml
-<!-- Example: Claude Code task notification format -->
-<task-notification>
-<task-id>...</task-id>
-<status>completed</status>
-<result>agent's actual text output</result>
-<usage>total_tokens: 3700
-tool_uses: 2
-duration_ms: 32400</usage>
-</task-notification>
-```
-
-Extract from each completed executor's metrics:
- **total_tokens** = total tokens consumed (input + output combined)
- **tool_uses** = number of tool calls made
- **duration_ms** / 1000 = execution time in seconds
-
-The exact format of completion notifications varies by environment — look for token counts, tool call counts, and duration in whatever format your environment provides.
-
-Record these per-run metrics alongside the grading results. The aggregate script can then compute mean/stddev/min/max across runs for each configuration.
-
-## Scripts
-
-Use these scripts at specific points in the workflow:
-
-### After Grading (Step 4: Aggregate)
-
-```bash
-# Aggregate all grading.json files into benchmark summary
-scripts/aggregate_benchmark.py <benchmark-dir> --skill-name <name> --skill-path <path>
-```
-
-This reads `grading.json` from each run directory and produces:
- `benchmark.json` - Structured results with run_summary statistics
- `benchmark.md` - Human-readable summary table
-
-### Validation
-
-```bash
-# Validate benchmark output
-scripts/validate_json.py <benchmark-dir>/benchmark.json
-
-# Validate individual grading files
-scripts/validate_json.py <run-dir>/grading.json --type grading
-```
-
-### Initialize Templates
-
-```bash
-# Create empty benchmark.json with correct structure (if not using aggregate script)
-scripts/init_json.py benchmark <benchmark-dir>/benchmark.json
-```
-
-## Analyzer Instructions
-
-The analyzer (always most capable) reviews all results and generates freeform notes. See `agents/analyzer.md` for the full prompt, but key responsibilities:
-
-1. **Compare configurations**: Which performed better? By how much?
-2. **Identify patterns**: Assertions that always pass/fail, high variance, etc.
-3. **Surface anomalies**: Unexpected results, broken evals, regressions
-4. **Provide context**: Why might these patterns exist?
-
-The analyzer should NOT:
- Suggest improvements to the skill (that's Improve mode)
- Make subjective quality judgments beyond the data
- Speculate without evidence
--- a/plugins/skill-creator/skills/skill-creator/references/eval-mode.md
+++ b/plugins/skill-creator/skills/skill-creator/references/eval-mode.md
@@ -1,144 +0,0 @@
-# Eval Mode Reference
-
-Eval mode runs skill evals and grades expectations. Enables measuring skill performance, comparing with/without skill, and validating that skills add value.
-
-## Purpose
-
-Evals serve to:
-1. **Set a floor** - Prove the skill helps Claude do something it couldn't by default
-2. **Raise the ceiling** - Enable iterating on skills to improve performance
-3. **Measure holistically** - Capture metrics beyond pass/fail (time, tokens)
-4. **Understand cross-model behavior** - Test skills across different models
-
-## Eval Workflow
-
-```
-0. Choose Workspace Location
-   → Ask user where to put workspace, suggest sensible default
-
-1. Check Dependencies
-   → Scan skill for dependencies, confirm availability with user
-
-2. Prepare (scripts/prepare_eval.py)
-   → Create task, copies skill, stages files
-
-3. Execute (agents/executor.md)
-   → Update task to implementing, spawn executor sub-agent
-   → Executor reads skill, runs prompt, saves transcript
-
-4. Grade (agents/grader.md)
-   → Update task to reviewing, spawn grader sub-agent
-   → Grader reads transcript + outputs, evaluates expectations
-
-5. Complete task, display results
-   → Pass/fail per expectation, overall pass rate, metrics
-```
-
-## Step 0: Setup
-
-**Before running any evals, read the output schemas:**
-
-```bash
-# Read to understand the JSON structures you'll produce
-Read references/schemas.md
-```
-
-This ensures you know the expected structure for:
- `grading.json` - What the grader produces
- `metrics.json` - What the executor produces
- `timing.json` - Wall clock timing format
-
-**Choose workspace location:**
-
-1. **Suggest default**: `<skill-name>-workspace/` as a sibling to the skill directory
-2. **Ask the user** using AskUserQuestion — if the workspace is inside a git repo, suggest adding it to `.gitignore`
-3. **Create the workspace directory** once confirmed
-
-## Step 1: Check Dependencies
-
-Before running evals, scan the skill for dependencies:
-
-1. Read SKILL.md (including `compatibility` frontmatter field)
-2. Check referenced scripts for required tools
-3. Present to user and confirm availability
-
-## Step 2: Prepare and Create Task
-
-Run prepare script and create task:
-
-```bash
-scripts/prepare_eval.py <skill-path> <eval-id> --output-dir <workspace>/eval-<id>/
-```
-
-```python
-task = TaskCreate(
-    subject=f"Eval {eval_id}"
-)
-TaskUpdate(task, status="planning")
-```
-
-## Step 3: Execute
-
-Update task to `implementing` and run the executor:
-
-```bash
-echo "{\"executor_start\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\"}" > <run-dir>/timing.json
-```
-
-**With subagents**: Spawn an executor subagent with these instructions:
-
-```
-Read agents/executor.md at: <skill-creator-path>/agents/executor.md
-
-Execute this eval:
- Skill path: <workspace>/skill/
- Prompt: <eval prompt from eval_metadata.json>
- Input files: <workspace>/eval-<id>/inputs/
- Save transcript to: <workspace>/eval-<id>/transcript.md
- Save outputs to: <workspace>/eval-<id>/outputs/
-```
-
-**Without subagents**: Read `agents/executor.md` and follow the procedure directly — execute the eval, save the transcript, and produce outputs inline.
-
-After execution completes, update timing.json with executor_end and duration.
-
-## Step 4: Grade
-
-Update task to `reviewing` and run the grader:
-
-**With subagents**: Spawn a grader subagent with these instructions:
-
-```
-Read agents/grader.md at: <skill-creator-path>/agents/grader.md
-
-Grade these expectations:
- Assertions: <list from eval_metadata.json>
- Transcript: <workspace>/eval-<id>/transcript.md
- Outputs: <workspace>/eval-<id>/outputs/
- Save grading to: <workspace>/eval-<id>/grading.json
-```
-
-**Without subagents**: Read `agents/grader.md` and follow the procedure directly — evaluate expectations against the transcript and outputs, then save grading.json.
-
-After grading completes, finalize timing.json.
-
-## Step 5: Display Results
-
-Update task to `completed`. Display:
- Pass/fail status for each expectation with evidence
- Overall pass rate
- Execution metrics from grading.json
- Wall clock time from timing.json
- **User notes summary**: Uncertainties, workarounds, and suggestions from the executor (may reveal issues even when expectations pass)
-
-## Comparison Workflow
-
-To compare skill-enabled vs no-skill performance:
-
-```
-1. Prepare both runs (with --no-skill flag for baseline)
-2. Execute both (parallel executors)
-3. Grade both (parallel graders)
-4. Blind Compare outputs
-5. Report winner with analysis
-```
--- a/plugins/skill-creator/skills/skill-creator/references/mode-diagrams.md
+++ b/plugins/skill-creator/skills/skill-creator/references/mode-diagrams.md
@@ -1,190 +0,0 @@
-# Mode Workflow Diagrams
-
-Visual representations of how each mode orchestrates building blocks.
-
-## Quality Assessment (Eval Mode)
-
-Measures how well a skill performs on its evals.
-
-```
-┌─────────────┐     ┌─────────────┐     ┌─────────────┐
-│  Executor   │────▶│   Grader    │────▶│  Aggregate  │
-│  (N runs)   │     │  (N runs)   │     │   Results   │
-└─────────────┘     └─────────────┘     └─────────────┘
-      │                   │                    │
-      ▼                   ▼                    ▼
- transcript.md       grading.json         summary.json
- user_notes.md       claims[]
- metrics.json        user_notes_summary
-```
-
-Use for: Testing skill performance, validating skill value.
-
-## Skill Improvement (Improve Mode)
-
-Iteratively improves a skill through blind comparison.
-
-```
-┌─────────────────────────────────────────────────────────────────┐
-│                        ITERATION LOOP                           │
-│                                                                 │
-│  ┌─────────┐   ┌─────────┐   ┌────────────┐   ┌──────────┐      │
-│  │Executor │──▶│ Grader  │──▶│ Comparator │──▶│ Analyzer │      │
-│  │(3 runs) │   │(3 runs) │   │  (blind)   │   │(post-hoc)│      │
-│  └─────────┘   └─────────┘   └────────────┘   └──────────┘      │
-│       │             │              │                │           │
-│       │             │              │                ▼           │
-│       │             │              │         suggestions        │
-│       │             │              ▼                │           │
-│       │             │         winner A/B            │           │
-│       │             ▼                               │           │
-│       │        pass_rate                            │           │
-│       ▼                                             ▼           │
-│  transcript                              ┌──────────────────┐   │
-│  user_notes                              │ Apply to v(N+1)  │   │
-│                                          └────────┬─────────┘   │
-│                                                   │             │
-└───────────────────────────────────────────────────┼─────────────┘
-                                                    │
-                                              (repeat until
-                                               goal or timeout)
-```
-
-Use for: Optimizing skill performance, iterating on skill instructions.
-
-## A/B Testing (with vs without skill)
-
-Compares skill-enabled vs no-skill performance.
-
-```
-┌─────────────────┐
-│ Executor        │──┐
-│ (with skill)    │  │     ┌────────────┐     ┌─────────┐
-└─────────────────┘  ├────▶│ Comparator │────▶│ Report  │
-┌─────────────────┐  │     │  (blind)   │     │ winner  │
-│ Executor        │──┘     └────────────┘     └─────────┘
-│ (without skill) │              │
-└─────────────────┘              ▼
-                            rubric scores
-                            expectation results
-```
-
-Use for: Proving skill adds value, measuring skill impact.
-
-## Skill Creation (Create Mode)
-
-Interactive skill development with user feedback.
-
-```
-┌───────────────────────────────────────────────────────────────┐
-│                    USER FEEDBACK LOOP                         │
-│                                                               │
-│  ┌───────────┐   ┌───────────┐   ┌───────────┐   ┌──────────┐ │
-│  │ Interview │──▶│  Research │──▶│   Draft   │──▶│   Run    │ │
-│  │   User    │   │ via MCPs  │   │  SKILL.md │   │ Example  │ │
-│  └───────────┘   └───────────┘   └───────────┘   └──────────┘ │
-│       ▲                                               │       │
-│       │                                               ▼       │
-│       │                                          user sees    │
-│       │                                          transcript   │
-│       │                                               │       │
-│       └───────────────────────────────────────────────┘       │
-│                         refine                                │
-└───────────────────────────────────────────────────────────────┘
-                              │
-                              ▼
-                    (once criteria defined,
-                     transition to Improve)
-```
-
-Use for: Creating new skills with tight user feedback.
-
-## Skill Benchmark (Benchmark Mode)
-
-Standardized performance measurement with variance analysis.
-
-```
-┌──────────────────────────────────────────────────────────────────────┐
-│                         FOR EACH EVAL                                │
-│                                                                      │
-│  ┌─────────────────────────────────┐  ┌─────────────────────────────┐│
-│  │        WITH SKILL (3x)          │  │      WITHOUT SKILL (3x)     ││
-│  │  ┌─────────┐ ┌─────────┐        │  │  ┌─────────┐ ┌─────────┐    ││
-│  │  │Executor │ │ Grader  │ ───┐   │  │  │Executor │ │ Grader  │───┐││
-│  │  │ run 1   │→│ run 1   │    │   │  │  │ run 1   │→│ run 1   │   │││
-│  │  └─────────┘ └─────────┘    │   │  │  └─────────┘ └─────────┘   │││
-│  │  ┌─────────┐ ┌─────────┐    │   │  │  ┌─────────┐ ┌─────────┐   │││
-│  │  │Executor │ │ Grader  │ ───┼───│──│──│Executor │ │ Grader  │───┼││
-│  │  │ run 2   │→│ run 2   │    │   │  │  │ run 2   │→│ run 2   │   │││
-│  │  └─────────┘ └─────────┘    │   │  │  └─────────┘ └─────────┘   │││
-│  │  ┌─────────┐ ┌─────────┐    │   │  │  ┌─────────┐ ┌─────────┐   │││
-│  │  │Executor │ │ Grader  │ ───┘   │  │  │Executor │ │ Grader  │───┘││
-│  │  │ run 3   │→│ run 3   │        │  │  │ run 3   │→│ run 3   │    ││
-│  │  └─────────┘ └─────────┘        │  │  └─────────┘ └─────────┘    ││
-│  └─────────────────────────────────┘  └─────────────────────────────┘│
-│                              │                    │                  │
-│                              └────────┬───────────┘                  │
-│                                       ▼                              │
-│                              ┌─────────────────┐                     │
-│                              │    Analyzer     │                     │
-│                              │  (most capable) │                     │
-│                              └────────┬────────┘                     │
-│                                       │                              │
-└───────────────────────────────────────┼──────────────────────────────┘
-                                        ▼
-                              ┌─────────────────┐
-                              │  benchmark.json │
-                              │  benchmark.md   │
-                              └─────────────────┘
-```
-
-Captures per-run: pass_rate, time_seconds, tokens, tool_calls, notes
-Aggregates: mean, stddev, min, max for each metric across configurations
-Analyzer surfaces: patterns, anomalies, and freeform observations
-
-Use for: Understanding skill performance, comparing across models, tracking regressions.
-
---
-
-## Inline Workflows (Without Subagents)
-
-When subagents aren't available, the same building blocks execute sequentially in the main loop.
-
-### Eval Mode (Inline)
-
-```
-┌───────────────────────────────────────────────────────┐
-│                  MAIN LOOP                            │
-│                                                       │
-│  Read executor.md → Execute eval → Save outputs       │
-│         │                                             │
-│         ▼                                             │
-│  Read grader.md → Grade expectations → Save grading   │
-│         │                                             │
-│         ▼                                             │
-│  Display results                                      │
-└───────────────────────────────────────────────────────┘
-```
-
-### Improve Mode (Inline)
-
-```
-┌───────────────────────────────────────────────────────┐
-│                  ITERATION LOOP                       │
-│                                                       │
-│  Read executor.md → Execute (1 run) → Save outputs    │
-│         │                                             │
-│         ▼                                             │
-│  Read grader.md → Grade expectations → Save grading   │
-│         │                                             │
-│         ▼                                             │
-│  Compare with previous best (inline, not blind)       │
-│         │                                             │
-│         ▼                                             │
-│  Analyze differences → Apply improvements to v(N+1)   │
-│         │                                             │
-│         (repeat until goal or timeout)                │
-└───────────────────────────────────────────────────────┘
-```
-
-Benchmark mode requires subagents and has no inline equivalent.
--- a/plugins/skill-creator/skills/skill-creator/references/schemas.md
+++ b/plugins/skill-creator/skills/skill-creator/references/schemas.md
@@ -1,32 +1,6 @@
 # JSON Schemas

-This document defines the JSON schemas used by skill-creator-edge.
-
-## Working with JSON Files
-
-### Initialize a new file with correct structure
-
-```bash
-scripts/init_json.py <type> <output-path>
-
-# Examples:
-scripts/init_json.py evals evals/evals.json
-scripts/init_json.py grading run-1/grading.json
-scripts/init_json.py benchmark benchmarks/2026-01-15/benchmark.json
-scripts/init_json.py metrics run-1/outputs/metrics.json
-```
-
-### Validate an existing file
-
-```bash
-scripts/validate_json.py <file-path> [--type <type>]
-
-# Examples:
-scripts/validate_json.py evals/evals.json
-scripts/validate_json.py run-1/grading.json --type grading
-```
-
-The validator infers the type from the filename when possible.
+This document defines the JSON schemas used by skill-creator.

 ---

@@ -224,15 +198,19 @@ Output from the executor agent. Located at `<run-dir>/outputs/metrics.json`.

 Wall clock timing for a run. Located at `<run-dir>/timing.json`.

+**How to capture:** When a subagent task completes, the task notification includes `total_tokens` and `duration_ms`. Save these immediately — they are not persisted anywhere else and cannot be recovered after the fact.
+
 ```json
 {
+  "total_tokens": 84852,
+  "duration_ms": 23332,
+  "total_duration_seconds": 23.3,
  "executor_start": "2026-01-15T10:30:00Z",
  "executor_end": "2026-01-15T10:32:45Z",
  "executor_duration_seconds": 165.0,
  "grader_start": "2026-01-15T10:32:46Z",
  "grader_end": "2026-01-15T10:33:12Z",
-  "grader_duration_seconds": 26.0,
-  "total_duration_seconds": 191.0
+  "grader_duration_seconds": 26.0
 }
 ```

@@ -257,6 +235,7 @@ Output from Benchmark mode. Located at `benchmarks/<timestamp>/benchmark.json`.
  "runs": [
    {
      "eval_id": 1,
+      "eval_name": "Ocean",
      "configuration": "with_skill",
      "run_number": 1,
      "result": {
@@ -308,10 +287,23 @@ Output from Benchmark mode. Located at `benchmarks/<timestamp>/benchmark.json`.

 **Fields:**
 - `metadata`: Information about the benchmark run
- `runs[]`: Individual run results with expectations and notes
+  - `skill_name`: Name of the skill
+  - `timestamp`: When the benchmark was run
+  - `evals_run`: List of eval names or IDs
+  - `runs_per_configuration`: Number of runs per config (e.g. 3)
+- `runs[]`: Individual run results
+  - `eval_id`: Numeric eval identifier
+  - `eval_name`: Human-readable eval name (used as section header in the viewer)
+  - `configuration`: Must be `"with_skill"` or `"without_skill"` (the viewer uses this exact string for grouping and color coding)
+  - `run_number`: Integer run number (1, 2, 3...)
+  - `result`: Nested object with `pass_rate`, `passed`, `total`, `time_seconds`, `tokens`, `errors`
 - `run_summary`: Statistical aggregates per configuration
+  - `with_skill` / `without_skill`: Each contains `pass_rate`, `time_seconds`, `tokens` objects with `mean` and `stddev` fields
+  - `delta`: Difference strings like `"+0.50"`, `"+13.0"`, `"+1700"`
 - `notes`: Freeform observations from the analyzer

+**Important:** The viewer reads these field names exactly. Using `config` instead of `configuration`, or putting `pass_rate` at the top level of a run instead of nested under `result`, will cause the viewer to show empty/zero values. Always reference this schema when generating benchmark.json manually.
+
 ---

 ## comparison.json