chore(skill-creator): update to latest skill-creator

This commit is contained in:
Kenshiro Nakagawa
2026-02-24 17:10:46 -08:00
parent 99e11d9592
commit e05013d229
23 changed files with 3634 additions and 2847 deletions

View File

@@ -1,149 +0,0 @@
# Benchmark Mode Reference
**Requires subagents.** Benchmark mode relies on parallel execution of many independent runs. Without subagents, use Eval mode for individual eval testing instead.
Benchmark mode runs a standardized, opinionated evaluation of a skill. It answers: **"How well does this skill perform?"**
Unlike Eval mode (which runs individual evals), Benchmark mode:
- Runs **all evals** (or a user-specified subset)
- Runs each eval **3 times per configuration** (with_skill and without_skill) for variance
- Captures **metrics with variance**: pass_rate, time_seconds, tokens
- Uses the **most capable model** as analyzer to surface patterns and anomalies
- Produces **persistent, structured output** for cross-run analysis
## When to Use
- **Understanding performance**: "How does my skill perform?"
- **Cross-model comparison**: "How does this skill perform across different models?"
- **Regression detection**: Compare benchmark results over time
- **Skill validation**: Does the skill actually add value over no-skill baseline?
## Defaults
**Always include no-skill baseline.** Every benchmark runs both `with_skill` and `without_skill` configurations. This measures the value the skill adds - without a baseline, you can't know if the skill helps.
**Suggest models for comparison.** If the user wants to compare across models, suggest a couple of commonly available models in your environment. Don't hardcode model names - just recommend what's commonly used and available.
**Run on current model by default.** If the user doesn't specify, run on whatever model is currently active. For cross-model comparison, ask which models they'd like to test.
## Terminology
| Term | Definition |
|------|------------|
| **Run** | A single execution of a skill on an eval prompt |
| **Configuration** | The experimental condition: `with_skill` or `without_skill` |
| **RunResult** | Graded output of a run: expectations, metrics, notes |
| **Run Summary** | Statistical aggregates across runs: mean, stddev, min, max |
| **Notes** | Freeform observations from the analyzer |
## Workflow
```
1. Setup
→ Choose workspace location (ask user, suggest <skill>-workspace/)
→ Verify evals exist
→ Determine which evals to run (all by default, or user subset)
2. Execute runs (parallel where possible)
→ For each eval:
→ 3 runs with_skill configuration
→ 3 runs without_skill configuration
→ Each run captures: transcript, outputs, metrics
→ Coordinator extracts from Task result: total_tokens, tool_uses, duration_ms
3. Grade runs (parallel)
→ Spawn grader for each run
→ Produces: expectations with pass/fail, notes
4. Aggregate results
→ Calculate run_summary per configuration:
→ pass_rate: mean, stddev, min, max
→ time_seconds: mean, stddev, min, max
→ tokens: mean, stddev, min, max
→ Calculate delta between configurations
5. Analyze (most capable model)
→ Review all results
→ Surface patterns, anomalies, observations as freeform notes
→ Examples:
- "Assertion X passes 100% in both configurations - may not differentiate skill value"
- "Eval 3 shows high variance (50% ± 40%) - may be flaky"
- "Skill adds 13s average time but improves pass rate by 50%"
6. Generate benchmark
→ benchmark.json - Structured data for analysis
→ benchmark.md - Human-readable summary
```
## Spawning Executors
Run executor subagents in the background for parallelism. When each agent completes, capture the execution metrics (tokens consumed, tool calls, duration) from the completion notification.
For example, in Claude Code, background subagents deliver a `<task-notification>` with a `<usage>` block:
```xml
<!-- Example: Claude Code task notification format -->
<task-notification>
<task-id>...</task-id>
<status>completed</status>
<result>agent's actual text output</result>
<usage>total_tokens: 3700
tool_uses: 2
duration_ms: 32400</usage>
</task-notification>
```
Extract from each completed executor's metrics:
- **total_tokens** = total tokens consumed (input + output combined)
- **tool_uses** = number of tool calls made
- **duration_ms** / 1000 = execution time in seconds
The exact format of completion notifications varies by environment — look for token counts, tool call counts, and duration in whatever format your environment provides.
Record these per-run metrics alongside the grading results. The aggregate script can then compute mean/stddev/min/max across runs for each configuration.
## Scripts
Use these scripts at specific points in the workflow:
### After Grading (Step 4: Aggregate)
```bash
# Aggregate all grading.json files into benchmark summary
scripts/aggregate_benchmark.py <benchmark-dir> --skill-name <name> --skill-path <path>
```
This reads `grading.json` from each run directory and produces:
- `benchmark.json` - Structured results with run_summary statistics
- `benchmark.md` - Human-readable summary table
### Validation
```bash
# Validate benchmark output
scripts/validate_json.py <benchmark-dir>/benchmark.json
# Validate individual grading files
scripts/validate_json.py <run-dir>/grading.json --type grading
```
### Initialize Templates
```bash
# Create empty benchmark.json with correct structure (if not using aggregate script)
scripts/init_json.py benchmark <benchmark-dir>/benchmark.json
```
## Analyzer Instructions
The analyzer (always most capable) reviews all results and generates freeform notes. See `agents/analyzer.md` for the full prompt, but key responsibilities:
1. **Compare configurations**: Which performed better? By how much?
2. **Identify patterns**: Assertions that always pass/fail, high variance, etc.
3. **Surface anomalies**: Unexpected results, broken evals, regressions
4. **Provide context**: Why might these patterns exist?
The analyzer should NOT:
- Suggest improvements to the skill (that's Improve mode)
- Make subjective quality judgments beyond the data
- Speculate without evidence

View File

@@ -1,144 +0,0 @@
# Eval Mode Reference
Eval mode runs skill evals and grades expectations. Enables measuring skill performance, comparing with/without skill, and validating that skills add value.
## Purpose
Evals serve to:
1. **Set a floor** - Prove the skill helps Claude do something it couldn't by default
2. **Raise the ceiling** - Enable iterating on skills to improve performance
3. **Measure holistically** - Capture metrics beyond pass/fail (time, tokens)
4. **Understand cross-model behavior** - Test skills across different models
## Eval Workflow
```
0. Choose Workspace Location
→ Ask user where to put workspace, suggest sensible default
1. Check Dependencies
→ Scan skill for dependencies, confirm availability with user
2. Prepare (scripts/prepare_eval.py)
→ Create task, copies skill, stages files
3. Execute (agents/executor.md)
→ Update task to implementing, spawn executor sub-agent
→ Executor reads skill, runs prompt, saves transcript
4. Grade (agents/grader.md)
→ Update task to reviewing, spawn grader sub-agent
→ Grader reads transcript + outputs, evaluates expectations
5. Complete task, display results
→ Pass/fail per expectation, overall pass rate, metrics
```
## Step 0: Setup
**Before running any evals, read the output schemas:**
```bash
# Read to understand the JSON structures you'll produce
Read references/schemas.md
```
This ensures you know the expected structure for:
- `grading.json` - What the grader produces
- `metrics.json` - What the executor produces
- `timing.json` - Wall clock timing format
**Choose workspace location:**
1. **Suggest default**: `<skill-name>-workspace/` as a sibling to the skill directory
2. **Ask the user** using AskUserQuestion — if the workspace is inside a git repo, suggest adding it to `.gitignore`
3. **Create the workspace directory** once confirmed
## Step 1: Check Dependencies
Before running evals, scan the skill for dependencies:
1. Read SKILL.md (including `compatibility` frontmatter field)
2. Check referenced scripts for required tools
3. Present to user and confirm availability
## Step 2: Prepare and Create Task
Run prepare script and create task:
```bash
scripts/prepare_eval.py <skill-path> <eval-id> --output-dir <workspace>/eval-<id>/
```
```python
task = TaskCreate(
subject=f"Eval {eval_id}"
)
TaskUpdate(task, status="planning")
```
## Step 3: Execute
Update task to `implementing` and run the executor:
```bash
echo "{\"executor_start\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\"}" > <run-dir>/timing.json
```
**With subagents**: Spawn an executor subagent with these instructions:
```
Read agents/executor.md at: <skill-creator-path>/agents/executor.md
Execute this eval:
- Skill path: <workspace>/skill/
- Prompt: <eval prompt from eval_metadata.json>
- Input files: <workspace>/eval-<id>/inputs/
- Save transcript to: <workspace>/eval-<id>/transcript.md
- Save outputs to: <workspace>/eval-<id>/outputs/
```
**Without subagents**: Read `agents/executor.md` and follow the procedure directly — execute the eval, save the transcript, and produce outputs inline.
After execution completes, update timing.json with executor_end and duration.
## Step 4: Grade
Update task to `reviewing` and run the grader:
**With subagents**: Spawn a grader subagent with these instructions:
```
Read agents/grader.md at: <skill-creator-path>/agents/grader.md
Grade these expectations:
- Assertions: <list from eval_metadata.json>
- Transcript: <workspace>/eval-<id>/transcript.md
- Outputs: <workspace>/eval-<id>/outputs/
- Save grading to: <workspace>/eval-<id>/grading.json
```
**Without subagents**: Read `agents/grader.md` and follow the procedure directly — evaluate expectations against the transcript and outputs, then save grading.json.
After grading completes, finalize timing.json.
## Step 5: Display Results
Update task to `completed`. Display:
- Pass/fail status for each expectation with evidence
- Overall pass rate
- Execution metrics from grading.json
- Wall clock time from timing.json
- **User notes summary**: Uncertainties, workarounds, and suggestions from the executor (may reveal issues even when expectations pass)
## Comparison Workflow
To compare skill-enabled vs no-skill performance:
```
1. Prepare both runs (with --no-skill flag for baseline)
2. Execute both (parallel executors)
3. Grade both (parallel graders)
4. Blind Compare outputs
5. Report winner with analysis
```

View File

@@ -1,190 +0,0 @@
# Mode Workflow Diagrams
Visual representations of how each mode orchestrates building blocks.
## Quality Assessment (Eval Mode)
Measures how well a skill performs on its evals.
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Executor │────▶│ Grader │────▶│ Aggregate │
│ (N runs) │ │ (N runs) │ │ Results │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
▼ ▼ ▼
transcript.md grading.json summary.json
user_notes.md claims[]
metrics.json user_notes_summary
```
Use for: Testing skill performance, validating skill value.
## Skill Improvement (Improve Mode)
Iteratively improves a skill through blind comparison.
```
┌─────────────────────────────────────────────────────────────────┐
│ ITERATION LOOP │
│ │
│ ┌─────────┐ ┌─────────┐ ┌────────────┐ ┌──────────┐ │
│ │Executor │──▶│ Grader │──▶│ Comparator │──▶│ Analyzer │ │
│ │(3 runs) │ │(3 runs) │ │ (blind) │ │(post-hoc)│ │
│ └─────────┘ └─────────┘ └────────────┘ └──────────┘ │
│ │ │ │ │ │
│ │ │ │ ▼ │
│ │ │ │ suggestions │
│ │ │ ▼ │ │
│ │ │ winner A/B │ │
│ │ ▼ │ │
│ │ pass_rate │ │
│ ▼ ▼ │
│ transcript ┌──────────────────┐ │
│ user_notes │ Apply to v(N+1) │ │
│ └────────┬─────────┘ │
│ │ │
└───────────────────────────────────────────────────┼─────────────┘
(repeat until
goal or timeout)
```
Use for: Optimizing skill performance, iterating on skill instructions.
## A/B Testing (with vs without skill)
Compares skill-enabled vs no-skill performance.
```
┌─────────────────┐
│ Executor │──┐
│ (with skill) │ │ ┌────────────┐ ┌─────────┐
└─────────────────┘ ├────▶│ Comparator │────▶│ Report │
┌─────────────────┐ │ │ (blind) │ │ winner │
│ Executor │──┘ └────────────┘ └─────────┘
│ (without skill) │ │
└─────────────────┘ ▼
rubric scores
expectation results
```
Use for: Proving skill adds value, measuring skill impact.
## Skill Creation (Create Mode)
Interactive skill development with user feedback.
```
┌───────────────────────────────────────────────────────────────┐
│ USER FEEDBACK LOOP │
│ │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌──────────┐ │
│ │ Interview │──▶│ Research │──▶│ Draft │──▶│ Run │ │
│ │ User │ │ via MCPs │ │ SKILL.md │ │ Example │ │
│ └───────────┘ └───────────┘ └───────────┘ └──────────┘ │
│ ▲ │ │
│ │ ▼ │
│ │ user sees │
│ │ transcript │
│ │ │ │
│ └───────────────────────────────────────────────┘ │
│ refine │
└───────────────────────────────────────────────────────────────┘
(once criteria defined,
transition to Improve)
```
Use for: Creating new skills with tight user feedback.
## Skill Benchmark (Benchmark Mode)
Standardized performance measurement with variance analysis.
```
┌──────────────────────────────────────────────────────────────────────┐
│ FOR EACH EVAL │
│ │
│ ┌─────────────────────────────────┐ ┌─────────────────────────────┐│
│ │ WITH SKILL (3x) │ │ WITHOUT SKILL (3x) ││
│ │ ┌─────────┐ ┌─────────┐ │ │ ┌─────────┐ ┌─────────┐ ││
│ │ │Executor │ │ Grader │ ───┐ │ │ │Executor │ │ Grader │───┐││
│ │ │ run 1 │→│ run 1 │ │ │ │ │ run 1 │→│ run 1 │ │││
│ │ └─────────┘ └─────────┘ │ │ │ └─────────┘ └─────────┘ │││
│ │ ┌─────────┐ ┌─────────┐ │ │ │ ┌─────────┐ ┌─────────┐ │││
│ │ │Executor │ │ Grader │ ───┼───│──│──│Executor │ │ Grader │───┼││
│ │ │ run 2 │→│ run 2 │ │ │ │ │ run 2 │→│ run 2 │ │││
│ │ └─────────┘ └─────────┘ │ │ │ └─────────┘ └─────────┘ │││
│ │ ┌─────────┐ ┌─────────┐ │ │ │ ┌─────────┐ ┌─────────┐ │││
│ │ │Executor │ │ Grader │ ───┘ │ │ │Executor │ │ Grader │───┘││
│ │ │ run 3 │→│ run 3 │ │ │ │ run 3 │→│ run 3 │ ││
│ │ └─────────┘ └─────────┘ │ │ └─────────┘ └─────────┘ ││
│ └─────────────────────────────────┘ └─────────────────────────────┘│
│ │ │ │
│ └────────┬───────────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Analyzer │ │
│ │ (most capable) │ │
│ └────────┬────────┘ │
│ │ │
└───────────────────────────────────────┼──────────────────────────────┘
┌─────────────────┐
│ benchmark.json │
│ benchmark.md │
└─────────────────┘
```
Captures per-run: pass_rate, time_seconds, tokens, tool_calls, notes
Aggregates: mean, stddev, min, max for each metric across configurations
Analyzer surfaces: patterns, anomalies, and freeform observations
Use for: Understanding skill performance, comparing across models, tracking regressions.
---
## Inline Workflows (Without Subagents)
When subagents aren't available, the same building blocks execute sequentially in the main loop.
### Eval Mode (Inline)
```
┌───────────────────────────────────────────────────────┐
│ MAIN LOOP │
│ │
│ Read executor.md → Execute eval → Save outputs │
│ │ │
│ ▼ │
│ Read grader.md → Grade expectations → Save grading │
│ │ │
│ ▼ │
│ Display results │
└───────────────────────────────────────────────────────┘
```
### Improve Mode (Inline)
```
┌───────────────────────────────────────────────────────┐
│ ITERATION LOOP │
│ │
│ Read executor.md → Execute (1 run) → Save outputs │
│ │ │
│ ▼ │
│ Read grader.md → Grade expectations → Save grading │
│ │ │
│ ▼ │
│ Compare with previous best (inline, not blind) │
│ │ │
│ ▼ │
│ Analyze differences → Apply improvements to v(N+1) │
│ │ │
│ (repeat until goal or timeout) │
└───────────────────────────────────────────────────────┘
```
Benchmark mode requires subagents and has no inline equivalent.

View File

@@ -1,32 +1,6 @@
# JSON Schemas
This document defines the JSON schemas used by skill-creator-edge.
## Working with JSON Files
### Initialize a new file with correct structure
```bash
scripts/init_json.py <type> <output-path>
# Examples:
scripts/init_json.py evals evals/evals.json
scripts/init_json.py grading run-1/grading.json
scripts/init_json.py benchmark benchmarks/2026-01-15/benchmark.json
scripts/init_json.py metrics run-1/outputs/metrics.json
```
### Validate an existing file
```bash
scripts/validate_json.py <file-path> [--type <type>]
# Examples:
scripts/validate_json.py evals/evals.json
scripts/validate_json.py run-1/grading.json --type grading
```
The validator infers the type from the filename when possible.
This document defines the JSON schemas used by skill-creator.
---
@@ -224,15 +198,19 @@ Output from the executor agent. Located at `<run-dir>/outputs/metrics.json`.
Wall clock timing for a run. Located at `<run-dir>/timing.json`.
**How to capture:** When a subagent task completes, the task notification includes `total_tokens` and `duration_ms`. Save these immediately — they are not persisted anywhere else and cannot be recovered after the fact.
```json
{
"total_tokens": 84852,
"duration_ms": 23332,
"total_duration_seconds": 23.3,
"executor_start": "2026-01-15T10:30:00Z",
"executor_end": "2026-01-15T10:32:45Z",
"executor_duration_seconds": 165.0,
"grader_start": "2026-01-15T10:32:46Z",
"grader_end": "2026-01-15T10:33:12Z",
"grader_duration_seconds": 26.0,
"total_duration_seconds": 191.0
"grader_duration_seconds": 26.0
}
```
@@ -257,6 +235,7 @@ Output from Benchmark mode. Located at `benchmarks/<timestamp>/benchmark.json`.
"runs": [
{
"eval_id": 1,
"eval_name": "Ocean",
"configuration": "with_skill",
"run_number": 1,
"result": {
@@ -308,10 +287,23 @@ Output from Benchmark mode. Located at `benchmarks/<timestamp>/benchmark.json`.
**Fields:**
- `metadata`: Information about the benchmark run
- `runs[]`: Individual run results with expectations and notes
- `skill_name`: Name of the skill
- `timestamp`: When the benchmark was run
- `evals_run`: List of eval names or IDs
- `runs_per_configuration`: Number of runs per config (e.g. 3)
- `runs[]`: Individual run results
- `eval_id`: Numeric eval identifier
- `eval_name`: Human-readable eval name (used as section header in the viewer)
- `configuration`: Must be `"with_skill"` or `"without_skill"` (the viewer uses this exact string for grouping and color coding)
- `run_number`: Integer run number (1, 2, 3...)
- `result`: Nested object with `pass_rate`, `passed`, `total`, `time_seconds`, `tokens`, `errors`
- `run_summary`: Statistical aggregates per configuration
- `with_skill` / `without_skill`: Each contains `pass_rate`, `time_seconds`, `tokens` objects with `mean` and `stddev` fields
- `delta`: Difference strings like `"+0.50"`, `"+13.0"`, `"+1700"`
- `notes`: Freeform observations from the analyzer
**Important:** The viewer reads these field names exactly. Using `config` instead of `configuration`, or putting `pass_rate` at the top level of a run instead of nested under `result`, will cause the viewer to show empty/zero values. Always reference this schema when generating benchmark.json manually.
---
## comparison.json