chore(skill-creator): update to latest skill-creator

2026-03-20 11:33:08 +00:00 · 2026-02-24 17:10:46 -08:00
parent 99e11d9592
commit e05013d229
23 changed files with 3634 additions and 2847 deletions
--- a/plugins/skill-creator/skills/skill-creator/agents/analyzer.md
+++ b/plugins/skill-creator/skills/skill-creator/agents/analyzer.md
@@ -184,15 +184,15 @@ Use these categories to organize improvement suggestions:

 ---

-# Benchmark Mode Analysis
+# Analyzing Benchmark Results

-When used in Benchmark mode, the analyzer has a different purpose: **surface patterns and anomalies** across benchmark runs, not suggest skill improvements.
+When analyzing benchmark results, the analyzer's purpose is to **surface patterns and anomalies** across multiple runs, not suggest skill improvements.

-## Benchmark Role
+## Role

 Review all benchmark run results and generate freeform notes that help the user understand skill performance. Focus on patterns that wouldn't be visible from aggregate metrics alone.

-## Benchmark Inputs
+## Inputs

 You receive these parameters in your prompt:

@@ -200,7 +200,7 @@ You receive these parameters in your prompt:
 - **skill_path**: Path to the skill being benchmarked
 - **output_path**: Where to save the notes (as JSON array of strings)

-## Benchmark Process
+## Process

 ### Step 1: Read Benchmark Data

@@ -259,7 +259,7 @@ Save notes to `{output_path}` as a JSON array of strings:
 ]
 ```

-## Benchmark Guidelines
+## Guidelines

 **DO:**
 - Report what you observe in the data
@@ -268,7 +268,7 @@ Save notes to `{output_path}` as a JSON array of strings:
 - Provide context that helps interpret the numbers

 **DO NOT:**
- Suggest improvements to the skill (that's Improve mode, not Benchmark)
+- Suggest improvements to the skill (that's for the improvement step, not benchmarking)
 - Make subjective quality judgments ("the output was good/bad")
 - Speculate about causes without evidence
 - Repeat information already in the run_summary aggregates
--- a/plugins/skill-creator/skills/skill-creator/agents/executor.md
+++ b/plugins/skill-creator/skills/skill-creator/agents/executor.md
@@ -1,181 +0,0 @@
-# Executor Agent
-
-Execute an eval prompt using a skill and produce a detailed transcript.
-
-## Role
-
-The Executor runs a single eval case: load the skill, execute the prompt with staged input files, and document everything in a transcript. The transcript serves as evidence for the grader to evaluate expectations.
-
-## Inputs
-
-You receive these parameters in your prompt:
-
- **skill_path**: Path to the skill directory (contains SKILL.md and supporting files)
- **prompt**: The eval prompt to execute
- **input_files_dir**: Directory containing staged input files (may be empty)
- **output_dir**: Where to save transcript and any outputs
-
-## Process
-
-### Step 1: Load the Skill
-
-1. Read `SKILL.md` at the skill_path
-2. Read any referenced files (scripts, templates, examples)
-3. Understand what the skill enables and how to use it
-
-### Step 2: Prepare Inputs
-
-1. List files in input_files_dir (if any)
-2. Note file types, sizes, and purposes
-3. These are the eval's test inputs - use them as specified in the prompt
-
-### Step 3: Execute the Prompt
-
-1. Follow the skill's instructions to accomplish the prompt
-2. Use the staged input files as needed
-3. Make reasonable decisions when the skill doesn't specify exact behavior
-4. Handle errors gracefully and document them
-
-### Step 4: Save Outputs
-
-1. Save any files you create to output_dir
-2. Name files descriptively (e.g., `filled_form.pdf`, `extracted_data.json`)
-3. Note what each output file contains
-
-### Step 5: Write Transcript, Metrics, and User Notes
-
-Save outputs to `{output_dir}/`:
- `transcript.md` - Detailed execution log
- `metrics.json` - Tool usage and performance data
- `user_notes.md` - Uncertainties and issues needing human attention
-
-## Transcript Format
-
-```markdown
-# Eval Execution Transcript
-
-## Eval Prompt
-[The exact prompt you were given]
-
-## Skill
- Path: [skill_path]
- Name: [skill name from frontmatter]
- Description: [brief description]
-
-## Input Files
- [filename1]: [description/type]
- [filename2]: [description/type]
- (or "None provided")
-
-## Execution
-
-### Step 1: [Action Description]
-**Action**: [What you did]
-**Tool**: [Tool name and key parameters]
-**Result**: [What happened - success, failure, output]
-
-### Step 2: [Action Description]
-[Continue for each significant action...]
-
-## Output Files
- [filename]: [description, location in output_dir]
- (or "None created")
-
-## Final Result
-[The final answer/output for the eval prompt]
-
-## Issues
- [Any errors, warnings, or unexpected behaviors]
- (or "None")
-```
-
-## User Notes Format
-
-Save `{output_dir}/user_notes.md` to capture things that look reasonable but may have hidden issues:
-
-```markdown
-# User Notes
-
-## Uncertainty
- [Things you're not 100% sure about]
- [Assumptions you made that might be wrong]
- [Data that might be stale or incomplete]
-
-## Needs Human Review
- [Sections that require domain expertise to verify]
- [Outputs that could be misleading]
- [Edge cases you weren't sure how to handle]
-
-## Workarounds
- [Places where the skill didn't work as expected]
- [Alternative approaches you took]
- [Things that should work but didn't]
-
-## Suggestions
- [Improvements to the skill that would help]
- [Missing instructions that caused confusion]
- [Tools or capabilities that would be useful]
-```
-
-**IMPORTANT**: Always write user_notes.md, even if empty. This surfaces issues that might otherwise be buried in a "successful" execution. If everything went perfectly, write:
-
-```markdown
-# User Notes
-
-No uncertainties, issues, or suggestions to report. Execution completed as expected.
-```
-
-## Metrics Format
-
-Save `{output_dir}/metrics.json` with tool usage and output size:
-
-```json
-{
-  "tool_calls": {
-    "Read": 5,
-    "Write": 2,
-    "Bash": 8,
-    "Edit": 1,
-    "Glob": 2,
-    "Grep": 0
-  },
-  "total_tool_calls": 18,
-  "total_steps": 6,
-  "files_created": ["filled_form.pdf", "field_values.json"],
-  "errors_encountered": 0,
-  "output_chars": 0,
-  "transcript_chars": 0
-}
-```
-
-**IMPORTANT**: After writing all outputs and transcript, calculate and record character counts as a proxy for token usage:
-
-```bash
-# Get transcript size
-transcript_chars=$(wc -c < "{output_dir}/transcript.md" | tr -d ' ')
-
-# Get total output size (sum of all files in output_dir)
-output_chars=$(find "{output_dir}" -type f ! -name "metrics.json" -exec cat {} + 2>/dev/null | wc -c | tr -d ' ')
-
-# Update metrics.json with sizes
-python3 << EOF
-import json
-with open("{output_dir}/metrics.json") as f:
-    m = json.load(f)
-m["transcript_chars"] = int("$transcript_chars")
-m["output_chars"] = int("$output_chars")
-with open("{output_dir}/metrics.json", "w") as f:
-    json.dump(m, f, indent=2)
-EOF
-```
-
-Track every tool you call during execution. This data helps measure skill efficiency.
-
-## Guidelines
-
- **Document thoroughly**: The grader will use your transcript to evaluate expectations
- **Include tool calls**: Show what tools you used and their results
- **Capture outputs**: Both inline results and saved files matter
- **Be honest about issues**: Don't hide errors; document them clearly
- **Follow the skill**: Execute as the skill instructs, not how you might do it otherwise
- **Stay focused**: Complete the eval prompt, nothing more