Add skill-creator plugin

This commit is contained in:
Kenshiro Nakagawa
2026-02-17 17:02:51 -08:00
parent 261ce4fba4
commit 30975e61e3
20 changed files with 4879 additions and 0 deletions

View File

@@ -0,0 +1,149 @@
# Benchmark Mode Reference
**Requires subagents.** Benchmark mode relies on parallel execution of many independent runs. Without subagents, use Eval mode for individual eval testing instead.
Benchmark mode runs a standardized, opinionated evaluation of a skill. It answers: **"How well does this skill perform?"**
Unlike Eval mode (which runs individual evals), Benchmark mode:
- Runs **all evals** (or a user-specified subset)
- Runs each eval **3 times per configuration** (with_skill and without_skill) for variance
- Captures **metrics with variance**: pass_rate, time_seconds, tokens
- Uses the **most capable model** as analyzer to surface patterns and anomalies
- Produces **persistent, structured output** for cross-run analysis
## When to Use
- **Understanding performance**: "How does my skill perform?"
- **Cross-model comparison**: "How does this skill perform across different models?"
- **Regression detection**: Compare benchmark results over time
- **Skill validation**: Does the skill actually add value over no-skill baseline?
## Defaults
**Always include no-skill baseline.** Every benchmark runs both `with_skill` and `without_skill` configurations. This measures the value the skill adds - without a baseline, you can't know if the skill helps.
**Suggest models for comparison.** If the user wants to compare across models, suggest a couple of commonly available models in your environment. Don't hardcode model names - just recommend what's commonly used and available.
**Run on current model by default.** If the user doesn't specify, run on whatever model is currently active. For cross-model comparison, ask which models they'd like to test.
## Terminology
| Term | Definition |
|------|------------|
| **Run** | A single execution of a skill on an eval prompt |
| **Configuration** | The experimental condition: `with_skill` or `without_skill` |
| **RunResult** | Graded output of a run: expectations, metrics, notes |
| **Run Summary** | Statistical aggregates across runs: mean, stddev, min, max |
| **Notes** | Freeform observations from the analyzer |
## Workflow
```
1. Setup
→ Choose workspace location (ask user, suggest <skill>-workspace/)
→ Verify evals exist
→ Determine which evals to run (all by default, or user subset)
2. Execute runs (parallel where possible)
→ For each eval:
→ 3 runs with_skill configuration
→ 3 runs without_skill configuration
→ Each run captures: transcript, outputs, metrics
→ Coordinator extracts from Task result: total_tokens, tool_uses, duration_ms
3. Grade runs (parallel)
→ Spawn grader for each run
→ Produces: expectations with pass/fail, notes
4. Aggregate results
→ Calculate run_summary per configuration:
→ pass_rate: mean, stddev, min, max
→ time_seconds: mean, stddev, min, max
→ tokens: mean, stddev, min, max
→ Calculate delta between configurations
5. Analyze (most capable model)
→ Review all results
→ Surface patterns, anomalies, observations as freeform notes
→ Examples:
- "Assertion X passes 100% in both configurations - may not differentiate skill value"
- "Eval 3 shows high variance (50% ± 40%) - may be flaky"
- "Skill adds 13s average time but improves pass rate by 50%"
6. Generate benchmark
→ benchmark.json - Structured data for analysis
→ benchmark.md - Human-readable summary
```
## Spawning Executors
Run executor subagents in the background for parallelism. When each agent completes, capture the execution metrics (tokens consumed, tool calls, duration) from the completion notification.
For example, in Claude Code, background subagents deliver a `<task-notification>` with a `<usage>` block:
```xml
<!-- Example: Claude Code task notification format -->
<task-notification>
<task-id>...</task-id>
<status>completed</status>
<result>agent's actual text output</result>
<usage>total_tokens: 3700
tool_uses: 2
duration_ms: 32400</usage>
</task-notification>
```
Extract from each completed executor's metrics:
- **total_tokens** = total tokens consumed (input + output combined)
- **tool_uses** = number of tool calls made
- **duration_ms** / 1000 = execution time in seconds
The exact format of completion notifications varies by environment — look for token counts, tool call counts, and duration in whatever format your environment provides.
Record these per-run metrics alongside the grading results. The aggregate script can then compute mean/stddev/min/max across runs for each configuration.
## Scripts
Use these scripts at specific points in the workflow:
### After Grading (Step 4: Aggregate)
```bash
# Aggregate all grading.json files into benchmark summary
scripts/aggregate_benchmark.py <benchmark-dir> --skill-name <name> --skill-path <path>
```
This reads `grading.json` from each run directory and produces:
- `benchmark.json` - Structured results with run_summary statistics
- `benchmark.md` - Human-readable summary table
### Validation
```bash
# Validate benchmark output
scripts/validate_json.py <benchmark-dir>/benchmark.json
# Validate individual grading files
scripts/validate_json.py <run-dir>/grading.json --type grading
```
### Initialize Templates
```bash
# Create empty benchmark.json with correct structure (if not using aggregate script)
scripts/init_json.py benchmark <benchmark-dir>/benchmark.json
```
## Analyzer Instructions
The analyzer (always most capable) reviews all results and generates freeform notes. See `agents/analyzer.md` for the full prompt, but key responsibilities:
1. **Compare configurations**: Which performed better? By how much?
2. **Identify patterns**: Assertions that always pass/fail, high variance, etc.
3. **Surface anomalies**: Unexpected results, broken evals, regressions
4. **Provide context**: Why might these patterns exist?
The analyzer should NOT:
- Suggest improvements to the skill (that's Improve mode)
- Make subjective quality judgments beyond the data
- Speculate without evidence

View File

@@ -0,0 +1,144 @@
# Eval Mode Reference
Eval mode runs skill evals and grades expectations. Enables measuring skill performance, comparing with/without skill, and validating that skills add value.
## Purpose
Evals serve to:
1. **Set a floor** - Prove the skill helps Claude do something it couldn't by default
2. **Raise the ceiling** - Enable iterating on skills to improve performance
3. **Measure holistically** - Capture metrics beyond pass/fail (time, tokens)
4. **Understand cross-model behavior** - Test skills across different models
## Eval Workflow
```
0. Choose Workspace Location
→ Ask user where to put workspace, suggest sensible default
1. Check Dependencies
→ Scan skill for dependencies, confirm availability with user
2. Prepare (scripts/prepare_eval.py)
→ Create task, copies skill, stages files
3. Execute (agents/executor.md)
→ Update task to implementing, spawn executor sub-agent
→ Executor reads skill, runs prompt, saves transcript
4. Grade (agents/grader.md)
→ Update task to reviewing, spawn grader sub-agent
→ Grader reads transcript + outputs, evaluates expectations
5. Complete task, display results
→ Pass/fail per expectation, overall pass rate, metrics
```
## Step 0: Setup
**Before running any evals, read the output schemas:**
```bash
# Read to understand the JSON structures you'll produce
Read references/schemas.md
```
This ensures you know the expected structure for:
- `grading.json` - What the grader produces
- `metrics.json` - What the executor produces
- `timing.json` - Wall clock timing format
**Choose workspace location:**
1. **Suggest default**: `<skill-name>-workspace/` as a sibling to the skill directory
2. **Ask the user** using AskUserQuestion — if the workspace is inside a git repo, suggest adding it to `.gitignore`
3. **Create the workspace directory** once confirmed
## Step 1: Check Dependencies
Before running evals, scan the skill for dependencies:
1. Read SKILL.md (including `compatibility` frontmatter field)
2. Check referenced scripts for required tools
3. Present to user and confirm availability
## Step 2: Prepare and Create Task
Run prepare script and create task:
```bash
scripts/prepare_eval.py <skill-path> <eval-id> --output-dir <workspace>/eval-<id>/
```
```python
task = TaskCreate(
subject=f"Eval {eval_id}"
)
TaskUpdate(task, status="planning")
```
## Step 3: Execute
Update task to `implementing` and run the executor:
```bash
echo "{\"executor_start\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\"}" > <run-dir>/timing.json
```
**With subagents**: Spawn an executor subagent with these instructions:
```
Read agents/executor.md at: <skill-creator-path>/agents/executor.md
Execute this eval:
- Skill path: <workspace>/skill/
- Prompt: <eval prompt from eval_metadata.json>
- Input files: <workspace>/eval-<id>/inputs/
- Save transcript to: <workspace>/eval-<id>/transcript.md
- Save outputs to: <workspace>/eval-<id>/outputs/
```
**Without subagents**: Read `agents/executor.md` and follow the procedure directly — execute the eval, save the transcript, and produce outputs inline.
After execution completes, update timing.json with executor_end and duration.
## Step 4: Grade
Update task to `reviewing` and run the grader:
**With subagents**: Spawn a grader subagent with these instructions:
```
Read agents/grader.md at: <skill-creator-path>/agents/grader.md
Grade these expectations:
- Assertions: <list from eval_metadata.json>
- Transcript: <workspace>/eval-<id>/transcript.md
- Outputs: <workspace>/eval-<id>/outputs/
- Save grading to: <workspace>/eval-<id>/grading.json
```
**Without subagents**: Read `agents/grader.md` and follow the procedure directly — evaluate expectations against the transcript and outputs, then save grading.json.
After grading completes, finalize timing.json.
## Step 5: Display Results
Update task to `completed`. Display:
- Pass/fail status for each expectation with evidence
- Overall pass rate
- Execution metrics from grading.json
- Wall clock time from timing.json
- **User notes summary**: Uncertainties, workarounds, and suggestions from the executor (may reveal issues even when expectations pass)
## Comparison Workflow
To compare skill-enabled vs no-skill performance:
```
1. Prepare both runs (with --no-skill flag for baseline)
2. Execute both (parallel executors)
3. Grade both (parallel graders)
4. Blind Compare outputs
5. Report winner with analysis
```

View File

@@ -0,0 +1,190 @@
# Mode Workflow Diagrams
Visual representations of how each mode orchestrates building blocks.
## Quality Assessment (Eval Mode)
Measures how well a skill performs on its evals.
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Executor │────▶│ Grader │────▶│ Aggregate │
│ (N runs) │ │ (N runs) │ │ Results │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
▼ ▼ ▼
transcript.md grading.json summary.json
user_notes.md claims[]
metrics.json user_notes_summary
```
Use for: Testing skill performance, validating skill value.
## Skill Improvement (Improve Mode)
Iteratively improves a skill through blind comparison.
```
┌─────────────────────────────────────────────────────────────────┐
│ ITERATION LOOP │
│ │
│ ┌─────────┐ ┌─────────┐ ┌────────────┐ ┌──────────┐ │
│ │Executor │──▶│ Grader │──▶│ Comparator │──▶│ Analyzer │ │
│ │(3 runs) │ │(3 runs) │ │ (blind) │ │(post-hoc)│ │
│ └─────────┘ └─────────┘ └────────────┘ └──────────┘ │
│ │ │ │ │ │
│ │ │ │ ▼ │
│ │ │ │ suggestions │
│ │ │ ▼ │ │
│ │ │ winner A/B │ │
│ │ ▼ │ │
│ │ pass_rate │ │
│ ▼ ▼ │
│ transcript ┌──────────────────┐ │
│ user_notes │ Apply to v(N+1) │ │
│ └────────┬─────────┘ │
│ │ │
└───────────────────────────────────────────────────┼─────────────┘
(repeat until
goal or timeout)
```
Use for: Optimizing skill performance, iterating on skill instructions.
## A/B Testing (with vs without skill)
Compares skill-enabled vs no-skill performance.
```
┌─────────────────┐
│ Executor │──┐
│ (with skill) │ │ ┌────────────┐ ┌─────────┐
└─────────────────┘ ├────▶│ Comparator │────▶│ Report │
┌─────────────────┐ │ │ (blind) │ │ winner │
│ Executor │──┘ └────────────┘ └─────────┘
│ (without skill) │ │
└─────────────────┘ ▼
rubric scores
expectation results
```
Use for: Proving skill adds value, measuring skill impact.
## Skill Creation (Create Mode)
Interactive skill development with user feedback.
```
┌───────────────────────────────────────────────────────────────┐
│ USER FEEDBACK LOOP │
│ │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌──────────┐ │
│ │ Interview │──▶│ Research │──▶│ Draft │──▶│ Run │ │
│ │ User │ │ via MCPs │ │ SKILL.md │ │ Example │ │
│ └───────────┘ └───────────┘ └───────────┘ └──────────┘ │
│ ▲ │ │
│ │ ▼ │
│ │ user sees │
│ │ transcript │
│ │ │ │
│ └───────────────────────────────────────────────┘ │
│ refine │
└───────────────────────────────────────────────────────────────┘
(once criteria defined,
transition to Improve)
```
Use for: Creating new skills with tight user feedback.
## Skill Benchmark (Benchmark Mode)
Standardized performance measurement with variance analysis.
```
┌──────────────────────────────────────────────────────────────────────┐
│ FOR EACH EVAL │
│ │
│ ┌─────────────────────────────────┐ ┌─────────────────────────────┐│
│ │ WITH SKILL (3x) │ │ WITHOUT SKILL (3x) ││
│ │ ┌─────────┐ ┌─────────┐ │ │ ┌─────────┐ ┌─────────┐ ││
│ │ │Executor │ │ Grader │ ───┐ │ │ │Executor │ │ Grader │───┐││
│ │ │ run 1 │→│ run 1 │ │ │ │ │ run 1 │→│ run 1 │ │││
│ │ └─────────┘ └─────────┘ │ │ │ └─────────┘ └─────────┘ │││
│ │ ┌─────────┐ ┌─────────┐ │ │ │ ┌─────────┐ ┌─────────┐ │││
│ │ │Executor │ │ Grader │ ───┼───│──│──│Executor │ │ Grader │───┼││
│ │ │ run 2 │→│ run 2 │ │ │ │ │ run 2 │→│ run 2 │ │││
│ │ └─────────┘ └─────────┘ │ │ │ └─────────┘ └─────────┘ │││
│ │ ┌─────────┐ ┌─────────┐ │ │ │ ┌─────────┐ ┌─────────┐ │││
│ │ │Executor │ │ Grader │ ───┘ │ │ │Executor │ │ Grader │───┘││
│ │ │ run 3 │→│ run 3 │ │ │ │ run 3 │→│ run 3 │ ││
│ │ └─────────┘ └─────────┘ │ │ └─────────┘ └─────────┘ ││
│ └─────────────────────────────────┘ └─────────────────────────────┘│
│ │ │ │
│ └────────┬───────────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Analyzer │ │
│ │ (most capable) │ │
│ └────────┬────────┘ │
│ │ │
└───────────────────────────────────────┼──────────────────────────────┘
┌─────────────────┐
│ benchmark.json │
│ benchmark.md │
└─────────────────┘
```
Captures per-run: pass_rate, time_seconds, tokens, tool_calls, notes
Aggregates: mean, stddev, min, max for each metric across configurations
Analyzer surfaces: patterns, anomalies, and freeform observations
Use for: Understanding skill performance, comparing across models, tracking regressions.
---
## Inline Workflows (Without Subagents)
When subagents aren't available, the same building blocks execute sequentially in the main loop.
### Eval Mode (Inline)
```
┌───────────────────────────────────────────────────────┐
│ MAIN LOOP │
│ │
│ Read executor.md → Execute eval → Save outputs │
│ │ │
│ ▼ │
│ Read grader.md → Grade expectations → Save grading │
│ │ │
│ ▼ │
│ Display results │
└───────────────────────────────────────────────────────┘
```
### Improve Mode (Inline)
```
┌───────────────────────────────────────────────────────┐
│ ITERATION LOOP │
│ │
│ Read executor.md → Execute (1 run) → Save outputs │
│ │ │
│ ▼ │
│ Read grader.md → Grade expectations → Save grading │
│ │ │
│ ▼ │
│ Compare with previous best (inline, not blind) │
│ │ │
│ ▼ │
│ Analyze differences → Apply improvements to v(N+1) │
│ │ │
│ (repeat until goal or timeout) │
└───────────────────────────────────────────────────────┘
```
Benchmark mode requires subagents and has no inline equivalent.

View File

@@ -0,0 +1,438 @@
# JSON Schemas
This document defines the JSON schemas used by skill-creator-edge.
## Working with JSON Files
### Initialize a new file with correct structure
```bash
scripts/init_json.py <type> <output-path>
# Examples:
scripts/init_json.py evals evals/evals.json
scripts/init_json.py grading run-1/grading.json
scripts/init_json.py benchmark benchmarks/2026-01-15/benchmark.json
scripts/init_json.py metrics run-1/outputs/metrics.json
```
### Validate an existing file
```bash
scripts/validate_json.py <file-path> [--type <type>]
# Examples:
scripts/validate_json.py evals/evals.json
scripts/validate_json.py run-1/grading.json --type grading
```
The validator infers the type from the filename when possible.
---
## evals.json
Defines the evals for a skill. Located at `evals/evals.json` within the skill directory.
```json
{
"skill_name": "example-skill",
"evals": [
{
"id": 1,
"prompt": "User's example prompt",
"expected_output": "Description of expected result",
"files": ["evals/files/sample1.pdf"],
"expectations": [
"The output includes X",
"The skill used script Y"
]
}
]
}
```
**Fields:**
- `skill_name`: Name matching the skill's frontmatter
- `evals[].id`: Unique integer identifier
- `evals[].prompt`: The task to execute
- `evals[].expected_output`: Human-readable description of success
- `evals[].files`: Optional list of input file paths (relative to skill root)
- `evals[].expectations`: List of verifiable statements
---
## history.json
Tracks version progression in Improve mode. Located at workspace root.
```json
{
"started_at": "2026-01-15T10:30:00Z",
"skill_name": "pdf",
"current_best": "v2",
"iterations": [
{
"version": "v0",
"parent": null,
"expectation_pass_rate": 0.65,
"grading_result": "baseline",
"is_current_best": false
},
{
"version": "v1",
"parent": "v0",
"expectation_pass_rate": 0.75,
"grading_result": "won",
"is_current_best": false
},
{
"version": "v2",
"parent": "v1",
"expectation_pass_rate": 0.85,
"grading_result": "won",
"is_current_best": true
}
]
}
```
**Fields:**
- `started_at`: ISO timestamp of when improvement started
- `skill_name`: Name of the skill being improved
- `current_best`: Version identifier of the best performer
- `iterations[].version`: Version identifier (v0, v1, ...)
- `iterations[].parent`: Parent version this was derived from
- `iterations[].expectation_pass_rate`: Pass rate from grading
- `iterations[].grading_result`: "baseline", "won", "lost", or "tie"
- `iterations[].is_current_best`: Whether this is the current best version
---
## grading.json
Output from the grader agent. Located at `<run-dir>/grading.json`.
```json
{
"expectations": [
{
"text": "The output includes the name 'John Smith'",
"passed": true,
"evidence": "Found in transcript Step 3: 'Extracted names: John Smith, Sarah Johnson'"
},
{
"text": "The spreadsheet has a SUM formula in cell B10",
"passed": false,
"evidence": "No spreadsheet was created. The output was a text file."
}
],
"summary": {
"passed": 2,
"failed": 1,
"total": 3,
"pass_rate": 0.67
},
"execution_metrics": {
"tool_calls": {
"Read": 5,
"Write": 2,
"Bash": 8
},
"total_tool_calls": 15,
"total_steps": 6,
"errors_encountered": 0,
"output_chars": 12450,
"transcript_chars": 3200
},
"timing": {
"executor_duration_seconds": 165.0,
"grader_duration_seconds": 26.0,
"total_duration_seconds": 191.0
},
"claims": [
{
"claim": "The form has 12 fillable fields",
"type": "factual",
"verified": true,
"evidence": "Counted 12 fields in field_info.json"
}
],
"user_notes_summary": {
"uncertainties": ["Used 2023 data, may be stale"],
"needs_review": [],
"workarounds": ["Fell back to text overlay for non-fillable fields"]
},
"eval_feedback": {
"suggestions": [
{
"assertion": "The output includes the name 'John Smith'",
"reason": "A hallucinated document that mentions the name would also pass"
}
],
"overall": "Assertions check presence but not correctness."
}
}
```
**Fields:**
- `expectations[]`: Graded expectations with evidence
- `summary`: Aggregate pass/fail counts
- `execution_metrics`: Tool usage and output size (from executor's metrics.json)
- `timing`: Wall clock timing (from timing.json)
- `claims`: Extracted and verified claims from the output
- `user_notes_summary`: Issues flagged by the executor
- `eval_feedback`: (optional) Improvement suggestions for the evals, only present when the grader identifies issues worth raising
---
## metrics.json
Output from the executor agent. Located at `<run-dir>/outputs/metrics.json`.
```json
{
"tool_calls": {
"Read": 5,
"Write": 2,
"Bash": 8,
"Edit": 1,
"Glob": 2,
"Grep": 0
},
"total_tool_calls": 18,
"total_steps": 6,
"files_created": ["filled_form.pdf", "field_values.json"],
"errors_encountered": 0,
"output_chars": 12450,
"transcript_chars": 3200
}
```
**Fields:**
- `tool_calls`: Count per tool type
- `total_tool_calls`: Sum of all tool calls
- `total_steps`: Number of major execution steps
- `files_created`: List of output files created
- `errors_encountered`: Number of errors during execution
- `output_chars`: Total character count of output files
- `transcript_chars`: Character count of transcript
---
## timing.json
Wall clock timing for a run. Located at `<run-dir>/timing.json`.
```json
{
"executor_start": "2026-01-15T10:30:00Z",
"executor_end": "2026-01-15T10:32:45Z",
"executor_duration_seconds": 165.0,
"grader_start": "2026-01-15T10:32:46Z",
"grader_end": "2026-01-15T10:33:12Z",
"grader_duration_seconds": 26.0,
"total_duration_seconds": 191.0
}
```
---
## benchmark.json
Output from Benchmark mode. Located at `benchmarks/<timestamp>/benchmark.json`.
```json
{
"metadata": {
"skill_name": "pdf",
"skill_path": "/path/to/pdf",
"executor_model": "claude-sonnet-4-20250514",
"analyzer_model": "most-capable-model",
"timestamp": "2026-01-15T10:30:00Z",
"evals_run": [1, 2, 3],
"runs_per_configuration": 3
},
"runs": [
{
"eval_id": 1,
"configuration": "with_skill",
"run_number": 1,
"result": {
"pass_rate": 0.85,
"passed": 6,
"failed": 1,
"total": 7,
"time_seconds": 42.5,
"tokens": 3800,
"tool_calls": 18,
"errors": 0
},
"expectations": [
{"text": "...", "passed": true, "evidence": "..."}
],
"notes": [
"Used 2023 data, may be stale",
"Fell back to text overlay for non-fillable fields"
]
}
],
"run_summary": {
"with_skill": {
"pass_rate": {"mean": 0.85, "stddev": 0.05, "min": 0.80, "max": 0.90},
"time_seconds": {"mean": 45.0, "stddev": 12.0, "min": 32.0, "max": 58.0},
"tokens": {"mean": 3800, "stddev": 400, "min": 3200, "max": 4100}
},
"without_skill": {
"pass_rate": {"mean": 0.35, "stddev": 0.08, "min": 0.28, "max": 0.45},
"time_seconds": {"mean": 32.0, "stddev": 8.0, "min": 24.0, "max": 42.0},
"tokens": {"mean": 2100, "stddev": 300, "min": 1800, "max": 2500}
},
"delta": {
"pass_rate": "+0.50",
"time_seconds": "+13.0",
"tokens": "+1700"
}
},
"notes": [
"Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value",
"Eval 3 shows high variance (50% ± 40%) - may be flaky or model-dependent",
"Without-skill runs consistently fail on table extraction expectations",
"Skill adds 13s average execution time but improves pass rate by 50%"
]
}
```
**Fields:**
- `metadata`: Information about the benchmark run
- `runs[]`: Individual run results with expectations and notes
- `run_summary`: Statistical aggregates per configuration
- `notes`: Freeform observations from the analyzer
---
## comparison.json
Output from blind comparator. Located at `<grading-dir>/comparison-N.json`.
```json
{
"winner": "A",
"reasoning": "Output A provides a complete solution with proper formatting and all required fields. Output B is missing the date field and has formatting inconsistencies.",
"rubric": {
"A": {
"content": {
"correctness": 5,
"completeness": 5,
"accuracy": 4
},
"structure": {
"organization": 4,
"formatting": 5,
"usability": 4
},
"content_score": 4.7,
"structure_score": 4.3,
"overall_score": 9.0
},
"B": {
"content": {
"correctness": 3,
"completeness": 2,
"accuracy": 3
},
"structure": {
"organization": 3,
"formatting": 2,
"usability": 3
},
"content_score": 2.7,
"structure_score": 2.7,
"overall_score": 5.4
}
},
"output_quality": {
"A": {
"score": 9,
"strengths": ["Complete solution", "Well-formatted", "All fields present"],
"weaknesses": ["Minor style inconsistency in header"]
},
"B": {
"score": 5,
"strengths": ["Readable output", "Correct basic structure"],
"weaknesses": ["Missing date field", "Formatting inconsistencies", "Partial data extraction"]
}
},
"expectation_results": {
"A": {
"passed": 4,
"total": 5,
"pass_rate": 0.80,
"details": [
{"text": "Output includes name", "passed": true}
]
},
"B": {
"passed": 3,
"total": 5,
"pass_rate": 0.60,
"details": [
{"text": "Output includes name", "passed": true}
]
}
}
}
```
---
## analysis.json
Output from post-hoc analyzer. Located at `<grading-dir>/analysis.json`.
```json
{
"comparison_summary": {
"winner": "A",
"winner_skill": "path/to/winner/skill",
"loser_skill": "path/to/loser/skill",
"comparator_reasoning": "Brief summary of why comparator chose winner"
},
"winner_strengths": [
"Clear step-by-step instructions for handling multi-page documents",
"Included validation script that caught formatting errors"
],
"loser_weaknesses": [
"Vague instruction 'process the document appropriately' led to inconsistent behavior",
"No script for validation, agent had to improvise"
],
"instruction_following": {
"winner": {
"score": 9,
"issues": ["Minor: skipped optional logging step"]
},
"loser": {
"score": 6,
"issues": [
"Did not use the skill's formatting template",
"Invented own approach instead of following step 3"
]
}
},
"improvement_suggestions": [
{
"priority": "high",
"category": "instructions",
"suggestion": "Replace 'process the document appropriately' with explicit steps",
"expected_impact": "Would eliminate ambiguity that caused inconsistent behavior"
}
],
"transcript_insights": {
"winner_execution_pattern": "Read skill -> Followed 5-step process -> Used validation script",
"loser_execution_pattern": "Read skill -> Unclear on approach -> Tried 3 different methods"
}
}
```