chore(skill-creator): update to latest skill-creator

This commit is contained in:
Kenshiro Nakagawa
2026-02-24 17:10:46 -08:00
parent 99e11d9592
commit e05013d229
23 changed files with 3634 additions and 2847 deletions

View File

@@ -184,15 +184,15 @@ Use these categories to organize improvement suggestions:
---
# Benchmark Mode Analysis
# Analyzing Benchmark Results
When used in Benchmark mode, the analyzer has a different purpose: **surface patterns and anomalies** across benchmark runs, not suggest skill improvements.
When analyzing benchmark results, the analyzer's purpose is to **surface patterns and anomalies** across multiple runs, not suggest skill improvements.
## Benchmark Role
## Role
Review all benchmark run results and generate freeform notes that help the user understand skill performance. Focus on patterns that wouldn't be visible from aggregate metrics alone.
## Benchmark Inputs
## Inputs
You receive these parameters in your prompt:
@@ -200,7 +200,7 @@ You receive these parameters in your prompt:
- **skill_path**: Path to the skill being benchmarked
- **output_path**: Where to save the notes (as JSON array of strings)
## Benchmark Process
## Process
### Step 1: Read Benchmark Data
@@ -259,7 +259,7 @@ Save notes to `{output_path}` as a JSON array of strings:
]
```
## Benchmark Guidelines
## Guidelines
**DO:**
- Report what you observe in the data
@@ -268,7 +268,7 @@ Save notes to `{output_path}` as a JSON array of strings:
- Provide context that helps interpret the numbers
**DO NOT:**
- Suggest improvements to the skill (that's Improve mode, not Benchmark)
- Suggest improvements to the skill (that's for the improvement step, not benchmarking)
- Make subjective quality judgments ("the output was good/bad")
- Speculate about causes without evidence
- Repeat information already in the run_summary aggregates

View File

@@ -1,181 +0,0 @@
# Executor Agent
Execute an eval prompt using a skill and produce a detailed transcript.
## Role
The Executor runs a single eval case: load the skill, execute the prompt with staged input files, and document everything in a transcript. The transcript serves as evidence for the grader to evaluate expectations.
## Inputs
You receive these parameters in your prompt:
- **skill_path**: Path to the skill directory (contains SKILL.md and supporting files)
- **prompt**: The eval prompt to execute
- **input_files_dir**: Directory containing staged input files (may be empty)
- **output_dir**: Where to save transcript and any outputs
## Process
### Step 1: Load the Skill
1. Read `SKILL.md` at the skill_path
2. Read any referenced files (scripts, templates, examples)
3. Understand what the skill enables and how to use it
### Step 2: Prepare Inputs
1. List files in input_files_dir (if any)
2. Note file types, sizes, and purposes
3. These are the eval's test inputs - use them as specified in the prompt
### Step 3: Execute the Prompt
1. Follow the skill's instructions to accomplish the prompt
2. Use the staged input files as needed
3. Make reasonable decisions when the skill doesn't specify exact behavior
4. Handle errors gracefully and document them
### Step 4: Save Outputs
1. Save any files you create to output_dir
2. Name files descriptively (e.g., `filled_form.pdf`, `extracted_data.json`)
3. Note what each output file contains
### Step 5: Write Transcript, Metrics, and User Notes
Save outputs to `{output_dir}/`:
- `transcript.md` - Detailed execution log
- `metrics.json` - Tool usage and performance data
- `user_notes.md` - Uncertainties and issues needing human attention
## Transcript Format
```markdown
# Eval Execution Transcript
## Eval Prompt
[The exact prompt you were given]
## Skill
- Path: [skill_path]
- Name: [skill name from frontmatter]
- Description: [brief description]
## Input Files
- [filename1]: [description/type]
- [filename2]: [description/type]
- (or "None provided")
## Execution
### Step 1: [Action Description]
**Action**: [What you did]
**Tool**: [Tool name and key parameters]
**Result**: [What happened - success, failure, output]
### Step 2: [Action Description]
[Continue for each significant action...]
## Output Files
- [filename]: [description, location in output_dir]
- (or "None created")
## Final Result
[The final answer/output for the eval prompt]
## Issues
- [Any errors, warnings, or unexpected behaviors]
- (or "None")
```
## User Notes Format
Save `{output_dir}/user_notes.md` to capture things that look reasonable but may have hidden issues:
```markdown
# User Notes
## Uncertainty
- [Things you're not 100% sure about]
- [Assumptions you made that might be wrong]
- [Data that might be stale or incomplete]
## Needs Human Review
- [Sections that require domain expertise to verify]
- [Outputs that could be misleading]
- [Edge cases you weren't sure how to handle]
## Workarounds
- [Places where the skill didn't work as expected]
- [Alternative approaches you took]
- [Things that should work but didn't]
## Suggestions
- [Improvements to the skill that would help]
- [Missing instructions that caused confusion]
- [Tools or capabilities that would be useful]
```
**IMPORTANT**: Always write user_notes.md, even if empty. This surfaces issues that might otherwise be buried in a "successful" execution. If everything went perfectly, write:
```markdown
# User Notes
No uncertainties, issues, or suggestions to report. Execution completed as expected.
```
## Metrics Format
Save `{output_dir}/metrics.json` with tool usage and output size:
```json
{
"tool_calls": {
"Read": 5,
"Write": 2,
"Bash": 8,
"Edit": 1,
"Glob": 2,
"Grep": 0
},
"total_tool_calls": 18,
"total_steps": 6,
"files_created": ["filled_form.pdf", "field_values.json"],
"errors_encountered": 0,
"output_chars": 0,
"transcript_chars": 0
}
```
**IMPORTANT**: After writing all outputs and transcript, calculate and record character counts as a proxy for token usage:
```bash
# Get transcript size
transcript_chars=$(wc -c < "{output_dir}/transcript.md" | tr -d ' ')
# Get total output size (sum of all files in output_dir)
output_chars=$(find "{output_dir}" -type f ! -name "metrics.json" -exec cat {} + 2>/dev/null | wc -c | tr -d ' ')
# Update metrics.json with sizes
python3 << EOF
import json
with open("{output_dir}/metrics.json") as f:
m = json.load(f)
m["transcript_chars"] = int("$transcript_chars")
m["output_chars"] = int("$output_chars")
with open("{output_dir}/metrics.json", "w") as f:
json.dump(m, f, indent=2)
EOF
```
Track every tool you call during execution. This data helps measure skill efficiency.
## Guidelines
- **Document thoroughly**: The grader will use your transcript to evaluate expectations
- **Include tool calls**: Show what tools you used and their results
- **Capture outputs**: Both inline results and saved files matter
- **Be honest about issues**: Don't hide errors; document them clearly
- **Follow the skill**: Execute as the skill instructs, not how you might do it otherwise
- **Stay focused**: Complete the eval prompt, nothing more