mirror of
https://github.com/anthropics/claude-plugins-official.git
synced 2026-03-20 11:33:08 +00:00
chore(skill-creator): update to latest skill-creator
This commit is contained in:
@@ -184,15 +184,15 @@ Use these categories to organize improvement suggestions:
|
||||
|
||||
---
|
||||
|
||||
# Benchmark Mode Analysis
|
||||
# Analyzing Benchmark Results
|
||||
|
||||
When used in Benchmark mode, the analyzer has a different purpose: **surface patterns and anomalies** across benchmark runs, not suggest skill improvements.
|
||||
When analyzing benchmark results, the analyzer's purpose is to **surface patterns and anomalies** across multiple runs, not suggest skill improvements.
|
||||
|
||||
## Benchmark Role
|
||||
## Role
|
||||
|
||||
Review all benchmark run results and generate freeform notes that help the user understand skill performance. Focus on patterns that wouldn't be visible from aggregate metrics alone.
|
||||
|
||||
## Benchmark Inputs
|
||||
## Inputs
|
||||
|
||||
You receive these parameters in your prompt:
|
||||
|
||||
@@ -200,7 +200,7 @@ You receive these parameters in your prompt:
|
||||
- **skill_path**: Path to the skill being benchmarked
|
||||
- **output_path**: Where to save the notes (as JSON array of strings)
|
||||
|
||||
## Benchmark Process
|
||||
## Process
|
||||
|
||||
### Step 1: Read Benchmark Data
|
||||
|
||||
@@ -259,7 +259,7 @@ Save notes to `{output_path}` as a JSON array of strings:
|
||||
]
|
||||
```
|
||||
|
||||
## Benchmark Guidelines
|
||||
## Guidelines
|
||||
|
||||
**DO:**
|
||||
- Report what you observe in the data
|
||||
@@ -268,7 +268,7 @@ Save notes to `{output_path}` as a JSON array of strings:
|
||||
- Provide context that helps interpret the numbers
|
||||
|
||||
**DO NOT:**
|
||||
- Suggest improvements to the skill (that's Improve mode, not Benchmark)
|
||||
- Suggest improvements to the skill (that's for the improvement step, not benchmarking)
|
||||
- Make subjective quality judgments ("the output was good/bad")
|
||||
- Speculate about causes without evidence
|
||||
- Repeat information already in the run_summary aggregates
|
||||
|
||||
@@ -1,181 +0,0 @@
|
||||
# Executor Agent
|
||||
|
||||
Execute an eval prompt using a skill and produce a detailed transcript.
|
||||
|
||||
## Role
|
||||
|
||||
The Executor runs a single eval case: load the skill, execute the prompt with staged input files, and document everything in a transcript. The transcript serves as evidence for the grader to evaluate expectations.
|
||||
|
||||
## Inputs
|
||||
|
||||
You receive these parameters in your prompt:
|
||||
|
||||
- **skill_path**: Path to the skill directory (contains SKILL.md and supporting files)
|
||||
- **prompt**: The eval prompt to execute
|
||||
- **input_files_dir**: Directory containing staged input files (may be empty)
|
||||
- **output_dir**: Where to save transcript and any outputs
|
||||
|
||||
## Process
|
||||
|
||||
### Step 1: Load the Skill
|
||||
|
||||
1. Read `SKILL.md` at the skill_path
|
||||
2. Read any referenced files (scripts, templates, examples)
|
||||
3. Understand what the skill enables and how to use it
|
||||
|
||||
### Step 2: Prepare Inputs
|
||||
|
||||
1. List files in input_files_dir (if any)
|
||||
2. Note file types, sizes, and purposes
|
||||
3. These are the eval's test inputs - use them as specified in the prompt
|
||||
|
||||
### Step 3: Execute the Prompt
|
||||
|
||||
1. Follow the skill's instructions to accomplish the prompt
|
||||
2. Use the staged input files as needed
|
||||
3. Make reasonable decisions when the skill doesn't specify exact behavior
|
||||
4. Handle errors gracefully and document them
|
||||
|
||||
### Step 4: Save Outputs
|
||||
|
||||
1. Save any files you create to output_dir
|
||||
2. Name files descriptively (e.g., `filled_form.pdf`, `extracted_data.json`)
|
||||
3. Note what each output file contains
|
||||
|
||||
### Step 5: Write Transcript, Metrics, and User Notes
|
||||
|
||||
Save outputs to `{output_dir}/`:
|
||||
- `transcript.md` - Detailed execution log
|
||||
- `metrics.json` - Tool usage and performance data
|
||||
- `user_notes.md` - Uncertainties and issues needing human attention
|
||||
|
||||
## Transcript Format
|
||||
|
||||
```markdown
|
||||
# Eval Execution Transcript
|
||||
|
||||
## Eval Prompt
|
||||
[The exact prompt you were given]
|
||||
|
||||
## Skill
|
||||
- Path: [skill_path]
|
||||
- Name: [skill name from frontmatter]
|
||||
- Description: [brief description]
|
||||
|
||||
## Input Files
|
||||
- [filename1]: [description/type]
|
||||
- [filename2]: [description/type]
|
||||
- (or "None provided")
|
||||
|
||||
## Execution
|
||||
|
||||
### Step 1: [Action Description]
|
||||
**Action**: [What you did]
|
||||
**Tool**: [Tool name and key parameters]
|
||||
**Result**: [What happened - success, failure, output]
|
||||
|
||||
### Step 2: [Action Description]
|
||||
[Continue for each significant action...]
|
||||
|
||||
## Output Files
|
||||
- [filename]: [description, location in output_dir]
|
||||
- (or "None created")
|
||||
|
||||
## Final Result
|
||||
[The final answer/output for the eval prompt]
|
||||
|
||||
## Issues
|
||||
- [Any errors, warnings, or unexpected behaviors]
|
||||
- (or "None")
|
||||
```
|
||||
|
||||
## User Notes Format
|
||||
|
||||
Save `{output_dir}/user_notes.md` to capture things that look reasonable but may have hidden issues:
|
||||
|
||||
```markdown
|
||||
# User Notes
|
||||
|
||||
## Uncertainty
|
||||
- [Things you're not 100% sure about]
|
||||
- [Assumptions you made that might be wrong]
|
||||
- [Data that might be stale or incomplete]
|
||||
|
||||
## Needs Human Review
|
||||
- [Sections that require domain expertise to verify]
|
||||
- [Outputs that could be misleading]
|
||||
- [Edge cases you weren't sure how to handle]
|
||||
|
||||
## Workarounds
|
||||
- [Places where the skill didn't work as expected]
|
||||
- [Alternative approaches you took]
|
||||
- [Things that should work but didn't]
|
||||
|
||||
## Suggestions
|
||||
- [Improvements to the skill that would help]
|
||||
- [Missing instructions that caused confusion]
|
||||
- [Tools or capabilities that would be useful]
|
||||
```
|
||||
|
||||
**IMPORTANT**: Always write user_notes.md, even if empty. This surfaces issues that might otherwise be buried in a "successful" execution. If everything went perfectly, write:
|
||||
|
||||
```markdown
|
||||
# User Notes
|
||||
|
||||
No uncertainties, issues, or suggestions to report. Execution completed as expected.
|
||||
```
|
||||
|
||||
## Metrics Format
|
||||
|
||||
Save `{output_dir}/metrics.json` with tool usage and output size:
|
||||
|
||||
```json
|
||||
{
|
||||
"tool_calls": {
|
||||
"Read": 5,
|
||||
"Write": 2,
|
||||
"Bash": 8,
|
||||
"Edit": 1,
|
||||
"Glob": 2,
|
||||
"Grep": 0
|
||||
},
|
||||
"total_tool_calls": 18,
|
||||
"total_steps": 6,
|
||||
"files_created": ["filled_form.pdf", "field_values.json"],
|
||||
"errors_encountered": 0,
|
||||
"output_chars": 0,
|
||||
"transcript_chars": 0
|
||||
}
|
||||
```
|
||||
|
||||
**IMPORTANT**: After writing all outputs and transcript, calculate and record character counts as a proxy for token usage:
|
||||
|
||||
```bash
|
||||
# Get transcript size
|
||||
transcript_chars=$(wc -c < "{output_dir}/transcript.md" | tr -d ' ')
|
||||
|
||||
# Get total output size (sum of all files in output_dir)
|
||||
output_chars=$(find "{output_dir}" -type f ! -name "metrics.json" -exec cat {} + 2>/dev/null | wc -c | tr -d ' ')
|
||||
|
||||
# Update metrics.json with sizes
|
||||
python3 << EOF
|
||||
import json
|
||||
with open("{output_dir}/metrics.json") as f:
|
||||
m = json.load(f)
|
||||
m["transcript_chars"] = int("$transcript_chars")
|
||||
m["output_chars"] = int("$output_chars")
|
||||
with open("{output_dir}/metrics.json", "w") as f:
|
||||
json.dump(m, f, indent=2)
|
||||
EOF
|
||||
```
|
||||
|
||||
Track every tool you call during execution. This data helps measure skill efficiency.
|
||||
|
||||
## Guidelines
|
||||
|
||||
- **Document thoroughly**: The grader will use your transcript to evaluate expectations
|
||||
- **Include tool calls**: Show what tools you used and their results
|
||||
- **Capture outputs**: Both inline results and saved files matter
|
||||
- **Be honest about issues**: Don't hide errors; document them clearly
|
||||
- **Follow the skill**: Execute as the skill instructs, not how you might do it otherwise
|
||||
- **Stay focused**: Complete the eval prompt, nothing more
|
||||
Reference in New Issue
Block a user