README clarifications from docs walkthrough testing

- Drop /reload-plugins (redundant, you restart with --channels next) - Fix token save path: .claude/channels/ not ~/.claude/channels/ - Clarify bot only responds once channel is running (pairing step)
Merge pull request #757 from anthropics/kenneth/add-bun-prereq
2026-03-20 11:33:08 +00:00 · 2026-03-19 16:58:17 -07:00 · 2026-03-19 16:31:36 -07:00 · 2026-03-19 16:28:02 -07:00
15 changed files with 11 additions and 1051 deletions
--- a/external_plugins/discord/README.md
+++ b/external_plugins/discord/README.md
@@ -4,6 +4,9 @@ Connect a Discord bot to your Claude Code with an MCP server.
 When the bot receives a message, the MCP server forwards it to Claude and provides tools to reply, react, and edit messages.
 ## Prerequisites
 - [Bun](https://bun.sh) — the MCP server runs on Bun. Install with `curl -fsSL https://bun.sh/install | bash`.
 ## Quick Setup
 > Default pairing flow for a single-user DM bot. See [ACCESS.md](./ACCESS.md) for groups and multi-user setups.
@@ -44,18 +47,15 @@ These are Claude Code commands — run `claude` to start a session first.
 Install the plugin:
 ```
 /plugin install discord@claude-plugins-official
 /reload-plugins
 ```
 Check that `/discord:configure` tab-completes. If not, restart your session.
 **5. Give the server the token.**
 ```
 /discord:configure MTIz...
 ```
-Writes `DISCORD_BOT_TOKEN=...` to `~/.claude/channels/discord/.env`. You can also write that file by hand, or set the variable in your shell environment — shell takes precedence.
+Writes `DISCORD_BOT_TOKEN=...` to `.claude/channels/discord/.env` in your project. You can also write that file by hand, or set the variable in your shell environment — shell takes precedence.
 **6. Relaunch with the channel flag.**
@@ -67,7 +67,7 @@ claude --channels plugin:discord@claude-plugins-official
 **7. Pair.**
-DM your bot on Discord — it replies with a pairing code. In your assistant session:
+With Claude Code running from the previous step, DM your bot on Discord — it replies with a pairing code. If the bot doesn't respond, make sure your session is running with `--channels`. In your Claude Code session:
 ```
 /discord:access pair <code>
--- a/external_plugins/telegram/README.md
+++ b/external_plugins/telegram/README.md
@@ -4,6 +4,10 @@ Connect a Telegram bot to your Claude Code with an MCP server.
 The MCP server logs into Telegram as a bot and provides tools to Claude to reply, react, or edit messages. When you message the bot, the server forwards the message to your Claude Code session.
 ## Prerequisites
 - [Bun](https://bun.sh) — the MCP server runs on Bun. Install with `curl -fsSL https://bun.sh/install | bash`.
 ## Quick Setup
 > Default pairing flow for a single-user DM bot. See [ACCESS.md](./ACCESS.md) for groups and multi-user setups.
@@ -23,18 +27,15 @@ These are Claude Code commands — run `claude` to start a session first.
 Install the plugin:
 ```
 /plugin install telegram@claude-plugins-official
 /reload-plugins
 ```
 Check that `/telegram:configure` tab-completes. If not, restart your session.
 **3. Give the server the token.**
 ```
 /telegram:configure 123456789:AAHfiqksKZ8...
 ```
-Writes `TELEGRAM_BOT_TOKEN=...` to `~/.claude/channels/telegram/.env`. You can also write that file by hand, or set the variable in your shell environment — shell takes precedence.
+Writes `TELEGRAM_BOT_TOKEN=...` to `.claude/channels/telegram/.env` in your project. You can also write that file by hand, or set the variable in your shell environment — shell takes precedence.
 **4. Relaunch with the channel flag.**
@@ -46,7 +47,7 @@ claude --channels plugin:telegram@claude-plugins-official
 **5. Pair.**
-DM your bot on Telegram — it replies with a 6-character pairing code. In your assistant session:
+With Claude Code running from the previous step, DM your bot on Telegram — it replies with a 6-character pairing code. If the bot doesn't respond, make sure your session is running with `--channels`. In your Claude Code session:
 ```
 /telegram:access pair <code>
--- a/plugins/math-olympiad/.claude-plugin/plugin.json
+++ b/plugins/math-olympiad/.claude-plugin/plugin.json
@@ -1,8 +0,0 @@
 {
  "name": "math-olympiad",
  "description": "Solve competition math (IMO, Putnam, USAMO) with adversarial verification that catches what self-verification misses. Fresh-context verifiers attack proofs with specific failure patterns. Calibrated abstention over bluffing.",
  "author": {
    "name": "Anthropic",
    "email": "support@anthropic.com"
  }
 }
--- a/plugins/math-olympiad/README.md
+++ b/plugins/math-olympiad/README.md
@@ -1,32 +0,0 @@
 # math-olympiad
 Competition math solver with adversarial verification.
 ## The problem
 Self-verification gets fooled. A verifier that sees the reasoning is biased toward agreement. arXiv:2503.21934 ("Proof or Bluff") showed 85.7% self-verified IMO success drops to <5% under human grading.
 ## The approach
 - **Context-isolated verification**: verifier sees only the clean proof, never the reasoning trace
 - **Pattern-armed adversarial checks**: not "is this correct?" but "does this accidentally prove RH?" / "extract the general lemma, find a 2×2 counterexample"
 - **Calibrated abstention**: says "no confident solution" rather than bluff
 - **Presentation pass**: produces clean LaTeX/PDF after verification passes
 ## Validation
 17/18 IMO+Putnam 2025 problems solved, 0 false positives, 2 novel proofs found. See the skill's eval data in the [anthropic monorepo](https://github.com/anthropics/anthropic/tree/staging/sandbox/sandbox/ralph/math_skills/eval_harness).
 ## Install
 ```
 /plugin install math-olympiad@claude-plugins-official
 ```
 ## Use
 ```
 > Solve this IMO problem: [statement]
 ```
 The skill auto-triggers on "IMO", "Putnam", "olympiad", "verify this proof", etc.
--- a/plugins/math-olympiad/skills/math-olympiad/SKILL.md
+++ b/plugins/math-olympiad/skills/math-olympiad/SKILL.md
@@ -1,282 +0,0 @@
 ---
 name: math-olympiad
 description: "Solve competition math problems (IMO, Putnam, USAMO, AIME) with adversarial verification that catches the errors self-verification misses. Activates when asked to 'solve this IMO problem', 'prove this olympiad inequality', 'verify this competition proof', 'find a counterexample', 'is this proof correct', or for any problem with 'IMO', 'Putnam', 'USAMO', 'olympiad', or 'competition math' in it. Uses pure reasoning (no tools) — then a fresh-context adversarial verifier attacks the proof using specific failure patterns, not generic 'check logic'. Outputs calibrated confidence — will say 'no confident solution' rather than bluff. If LaTeX is available, produces a clean PDF after verification passes."
 version: 0.1.0
 ---
 # Math Olympiad Solver
 ## The five things that change outcomes
 1. **Strip thinking before verifying** — a verifier that sees the reasoning is biased toward agreement. Fresh context, cleaned proof only.
 2. **"Does this prove RH?"** — if your theorem's specialization to ζ is a famous open problem, you have a gap. Most reliable red flag.
 3. **Short proof → extract the general lemma** — try 2×2 counterexamples. If general form is false, find what's special about THIS instance.
 4. **Same gap twice → step back** — the case split may be obscuring a unified argument. Three lines sometimes does what twelve pages couldn't.
 5. **Say "no confident solution"** — wrong-and-confident is worse than honest abstain.
 ---
 **Tool policy**: Solvers and verifiers use THINKING ONLY in the tight-budget workflow. Competition math is reasoning. Computation is for deep mode (§6c), and even then bounded — a recurrence that's doubly-exponential can't be computed past n~30, work mod 2^m instead.
 ---
 ## When to use which approach
 | Problem | Approach | Verification |
 |---|---|---|
 | AIME numeric answer | Best-of-N → majority vote | Answer check only |
 | Olympiad proof (IMO/Putnam/USAMO) | Full workflow below | 5-pass adversarial |
 | "Is this proof correct?" | Skip to verification (step 4) | Adversarial + spec-gaming |
 | **Full problem set** (e.g. all 6 from a competition) | Sequential: one full workflow per problem, collect results, compile single PDF | Per-problem adversarial |
 **Batch in one Workflow**: Set `opts.label` on every `agent()` call to include the problem ID (e.g., `label: "P3:solver:2"`). Without labels, 36 results come back with no problem association. Run problems in parallel — the label is what matters, not ordering.
 ### For a full problem set
 Launch one solver workflow per problem (same VERBATIM prompt, different statement). Run them in parallel. When all return, run adversarial verification per problem. Problems that pass get their proof in the PDF; problems that abstain get "No confident solution" with partial notes.
 Don't try to solve all N problems in one agent's context — each problem needs its own thinking budget and its own fresh-context verifier. The composition is mechanical: collect the per-problem outputs, fill in LaTeX sections, compile once.
 | "Simplify this proof" | Skip to presentation (step 8) | — |
 ---
 ## The Workflow
 ### 1. Interpretation check (30 seconds, catches 50/63 of one class of errors)
 Before solving anything, identify the interpretation.
 > Read the problem statement. List 2-3 ways it could be interpreted. For each: is this reading TRIVIAL? If one reading makes the problem easy and another makes it hard, the hard one is almost certainly intended. State which interpretation you're solving and WHY you believe it's the intended one.
 The Aletheia case study found 50 of 63 "technically correct" solutions were for the wrong interpretation. Olympiad problems often have a trap easy reading.
 ### 2. Generate candidates with internal refinement (parallel, thinking only)
 Launch 8-12 attempt agents in parallel. **Each agent internally iterates** — solve → self-improve → self-verify → correct → repeat. This is the Yang-Huang structure that achieves 85.7% on IMO: one-shot solving isn't enough; per-attempt refinement matters.
 **The Agent tool cannot enforce tool restriction.** Subagents get the full tool set. The only mechanism is the prompt. Use this prompt VERBATIM — do not summarize, do not synthesize your own:
 ```
 NO COMPUTATION. Do not use Bash, Python, WebSearch, Read, Write, or any tool that runs code or fetches data. Numerical verification is not a proof step. "I computed n=1..10 and the pattern holds" is not a proof.
 (If your agent harness requires a StructuredOutput or similar return-mechanism tool call, that is NOT a computation tool — call it to return your answer. The restriction is on tools that DO work, not tools that REPORT work.)
 Your internal process (iterate until done):
 - Solve: Complete rigorous solution.
 - Self-improve: Reread. Fix gaps before a grader sees it.
 - Self-verify: Strict grader mode. Every step justified?
 - Correct: Fix and re-verify. Up to 5 rounds.
 - Stop: Self-verify passes twice clean, OR 5 rounds, OR approach fundamentally wrong.
 A correct answer from flawed reasoning is a failure. If incomplete, say so honestly. Never hide gaps.
 PROBLEM: <insert the problem statement here>
 ANGLE: <insert one starting angle here>
 ```
 The first two paragraphs are load-bearing. A session that writes its own prompt and omits them will produce subagents that grind Python for 30 iterations and confidently get wrong answers — a pattern that fits n≤10 but fails at n=100 is not a proof.
 Starting angles (vary across agents — see `references/solver_heuristics.md`):
 - Work out small cases (test past n=3)
 - Look for an invariant or monovariant
 - Consider the extremal case
 - Try induction
 - What symmetries?
 - Work backwards
 - Drop a condition — where does it become trivially false?
 - Generalize (inventor's paradox — more structure is sometimes easier)
 Each returns its FINAL state (not intermediate rounds):
 ```
 **Verdict**: complete solution | partial result | no progress
 **Rounds**: [how many verify→correct cycles]
 **Method**: [key idea, one paragraph]
 **Detailed Solution**: [full step-by-step, every step justified]
 **Answer**: [if applicable]
 **Self-verification notes**: [what you caught and fixed; remaining concerns]
 ```
 **Retry policy**: If an agent fails or times out, retry once. Transient failures happen.
 ### 3. Clean the solution (context isolation — the #1 lever)
 The thinking trace biases the verifier toward agreement — a long chain of reasoning reads as supporting evidence even when the conclusion is wrong. Before any verification, strip:
 - All thinking-block content
 - All "Let me try..." / "Actually wait..." / "Hmm" prose
 - All false starts and backtracking
 What remains: problem statement + clean final argument only.
 Extract only the **Method** + **Proof** + **Answer** sections from each solver's output. The verifier never sees how the solver got there.
 ### 4. Adversarial verify (fresh context, pattern-armed)
 For each cleaned solution, launch a fresh verifier agent. **Fresh context**: it sees only (problem statement + cleaned solution). **No tools.**
 The verifier's job is to ATTACK, not grade. Load `references/adversarial_prompts.md` for the prompts. The key patterns it runs:
 | Pattern | The check |
 |---|---|
 | **#4** | Does this theorem specialize to a famous object (ζ, quadratic reciprocity, etc.) and prove something open about it? → gap |
 | **#18** | Substitute the proof's own intermediate identities into any "remaining gap." Recover the original claim? → tautological |
 | **#40** | Is any step a "one-line lemma"? Extract the GENERAL form. Find a 2×2 counterexample. If the general form is false, find what special structure saves THIS instance |
 | **#5** | For each invoked theorem: re-check hypotheses FROM SCRATCH. "Continuous on [0,1]" ≠ "continuous on ℝ" |
 | **#6** | Any infinite sum "bounded" via a regularized value? Check the boundary — if there's a pole there, the sum diverges |
 Full pattern list: `references/verifier_patterns.md`
 Verifier returns:
 ```
 **Verdict**: HOLDS | HOLE FOUND | UNCLEAR
 **If HOLE FOUND**:
 - Location: [quote the problematic step]
 - Pattern: [which check fired, or "other"]
 - Why it breaks: [specific]
 - Fixable?: [yes with X / no, fundamental]
 ```
 ### 5. Rank and vote-verify (asymmetric + early exit)
 Rank solutions by (verdict, verifier confidence). Take the top one. Run up to 5 fresh verifier agents.
 **Asymmetric thresholds**: 4 HOLDS to confirm, 2 HOLE FOUND to refute. Why asymmetric: one flaky verifier shouldn't kill a correct proof; but two independent dissents is a real signal.
 **Pigeonhole early exit**: stop launching verifiers once the outcome is decided.
 - 2 say HOLE FOUND → refuted, stop (save the remaining 3 calls)
 - 4 say HOLDS → confirmed, stop (save the 5th)
 - After 3 verifiers: if 2 HOLDS + 1 HOLE, launch 2 more (outcome undecided). If 3 HOLDS + 0 HOLE, launch 1 more (could still hit 4-1).
 **Dual context-isolation**: each verifier is blind to (a) the solver's thinking trace — already stripped in step 3 — AND (b) other verifiers' verdicts. Each verifier thinks it's the first. No "3 agents already confirmed this" social proof.
 **A solver cannot verify its own solution.** Different agent, fresh context.
 ### 5b. When one case won't close — step back before grinding
 If a proof splits into cases and one case proves easily but the other resists: **before grinding through the hard case, ask whether there's a route that makes the split disappear.**
 The pattern that saves you: the hard case's very hypothesis often implies something strong about an *intermediate object* you haven't looked at. Use that implication directly instead of the original chain.
 Concrete shape: proving f(n) ≤ cn for a constrained function f, with a case split on a prime p dividing f(n). One branch closes by index arguments in (ℤ/p^e)*. The other branch resists — same group structure, but the arithmetic doesn't contradict. The fix: the hypothesis "p | f(n)" plugged back into the governing equation implies **f(p) = p itself**. Once you have that, a Fermat+Dirichlet argument kills both branches in three lines. The case split was a detour — it was splitting on a variable that, under the hypothesis, takes a known value.
 Check when stuck on case B:
 - What does case B's hypothesis imply about f at *other* inputs?
 - Is there a different pair (a,b) to plug into the governing equation?
 - Are you proving too much? (A cleaner contradiction needs less machinery.)
 This is also a presentation-pass win: the split-free proof is shorter AND more general.
 ### 6. Revise (if needed)
 If verification finds a hole: launch a reviser agent. It gets (cleaned solution + verifier's hole report). STILL no access to the original thinking — the reviser works from the hole, not by rereading how you got there.
 ```
 A verifier found this issue in the proof:
 [hole report]
 Fix the proof. If the hole is fundamental (the approach doesn't work), say so and return **Verdict: no confident solution** with what partial progress remains.
 For any step you cannot fully close, mark it inline: [GAP: specific description of what remains]. Gaps in the proof text, not in a separate list — they're greppable and the next reviser knows exactly where to look.
 ```
 Up to 3 revise cycles. Then re-run the vote on the revised proof.
 **If pattern #40 fired** (one-line-proof-too-clean), the reviser gets a stronger brief — the Adversarial Brief template from `references/adversarial_prompts.md` §7. It forces a binary: "the general lemma is obviously false (here's a 2×2 counterexample) — so either find what's special about THIS case, or find where the proof breaks." Can't return "looks fine."
 ### 6c. Deep mode (when tight-budget abstains)
 The standard workflow is tight-budget: 8 solvers, ~15 min, pure reasoning. When it abstains, the problem may need more time, not more capability.
 **Deep mode** is a single focused agent with:
 - **Unlimited time** — no wall-clock pressure
 - **Targeted computation allowed** — modular arithmetic checks, small-case enumeration, symbolic verification of identities. NOT exploratory brute force or unbounded recursion.
 - **The abstention reason as starting point** — if verifiers found a specific gap, start there. If solvers never claimed complete, start from what they partially proved.
 The archetype: a focused agent that gets the proven-so-far state plus "one case of Lemma 5 is open" — and finds a 3-line argument the case split was obscuring. Often under 10 minutes with almost no computation. Deep mode is about giving the problem sustained attention, not throwing compute at it.
 **What deep mode is NOT**: open-ended exploration, literature search, looking up solutions, multi-day investigation. That's a different workflow (`math-research`). Deep mode is still "solve THIS problem yourself" — just without the clock.
 **NO WEB. NO LOOKUP.** Deep mode may use Bash/Python for bounded computation, but NEVER WebFetch, WebSearch, or any network access. Finding the solution on AoPS or a blog is not solving the problem — it's cheating on an olympiad, and it teaches us nothing about the skill's actual capability. Put this at the TOP of the deep-mode prompt:
 ```
 NO WEB ACCESS. Do not use WebFetch, WebSearch, or any tool that touches the internet. Do not look up this problem, its solution, or related problems. You are solving this yourself — the only allowed computation is local (Bash/Python for mod-k arithmetic, small-case enumeration n≤10, symbolic identity checks). If you invoke a web tool, the proof is void.
 ```
 **Computation bounds in deep mode** (bug #8 lesson): A6's b_{n+1}=2b_n²+b_n+1 is doubly-exponential; b_99 has ~10^{2^98} digits. Never compute such objects exactly — work in ℤ/2^m, or track only v_p(·), or prove the recursion mod the quantity you care about. If a computation is running longer than 60 seconds, it's probably unbounded. Kill it and work symbolically.
 **Step 6d (not optional)**: After any ABSTAIN at the verify stage, automatically launch one deep-mode agent before writing the abstention into the output. Give it:
 - The problem statement
 - The best partial proof from tight-budget solvers
 - The verifier gap descriptions (what specifically didn't close)
 - The instruction: "NO WEB ACCESS — do not look up this problem or its solution. Bounded local computation allowed (mod 2^k, small cases n≤10, symbolic identity checks via Bash/Python only). 60-second computation limit. If n≤10 brute force reveals a pattern the tight-budget solvers missed, that pattern IS the proof structure."
 The deep agent may find the construction the pure-reasoning solvers couldn't see. If it also abstains, THEN write the abstention. Do not skip this step — problems with √n or log n answers are often invisible to pure reasoning because the optimal structure is the asymmetric one.
 **Orchestrator self-restraint**: The orchestrator itself must not web-search the problem "to help" the deep agent. If you're tempted to Fetch an AoPS thread "just to check the answer," don't — that contaminates the skill's output and misrepresents its capability.
 ### 7. Calibrated abstention
 If 3 revise cycles all fail: **stop and admit it.**
 ```
 **Verdict**: no confident solution
 **What was tried**: [approaches]
 **What WAS proven**: [any lemma or partial result that survived verification]
 **Where it breaks**: [the unfixed hole]
 ```
 Do NOT guess. A wrong confident answer is worse than an honest "couldn't solve it." The metric that matters is CONDITIONAL accuracy — when you say "solved," are you right?
 ### 8. Presentation pass (after correctness is established)
 A VERIFIED-CORRECT proof is often not a BEAUTIFUL proof. The order you discovered it is rarely the best order to present it. Launch a fresh presentation agent with the verified proof.
 Load `references/presentation_prompts.md`. The agent asks:
 - What's the simplest way to say this?
 - Which lemmas should be inlined? Which deserve to stand alone?
 - Is anything OVERKILL? (constructing a double exponential when linear suffices)
 - Now that we know the answer, is there a 3-line hindsight proof?
 Output: LaTeX-formatted proof. If `pdflatex` is available (`scripts/check_latex.sh` returns 0), also compile to PDF via `scripts/compile_pdf.sh`.
 ---
 ## Model tier defaults
 Read `references/model_tier_defaults.md` for full details. Summary:
 | Model | Solvers | Verify passes | Abstain after | Presentation |
 |---|---|---|---|---|
 | Haiku 4.5 | 8 | 3 | 2 revise fails | skip |
 | Sonnet 4.6 | 4 | 5 | 3 revise fails | yes |
 | Opus 4.6 / Capybara | 3 | 5 + full pattern sweep | 4 revise fails | 2 drafts, pick cleaner |
 Weaker models: more parallel attempts, faster abstention. Stronger models: deeper verification, more presentation effort.
 ---
 ## For numeric-answer problems (AIME-style)
 Skip the proof machinery. Run 5-7 solvers with varied approaches, take majority vote on the numeric answer. If no majority: verify the top 2 candidates by substitution.
 ---
 ## Key references
 - `references/verifier_patterns.md` — the 12 adversarial checks
 - `references/adversarial_prompts.md` — ready-to-use verifier prompts
 - `references/presentation_prompts.md` — beautification prompts + LaTeX template
 - `references/model_tier_defaults.md` — per-model configuration
 ---
 ## What makes this different from generic verify-and-refine
 1. **Dual context isolation**: verifier is blind to (a) the solver's thinking trace — which biases toward agreement — and (b) other verifiers' verdicts — social proof also biases. Each verifier thinks it's first.
 2. **Pattern-specific attacks**: not "is this correct?" but "does this make the #40 mistake? the #4 mistake?" Specific beats generic. The 7-category refutation taxonomy gives the verifier a checklist.
 3. **Asymmetric vote + pigeonhole exit**: 4-to-confirm, 2-to-refute. One flaky verifier doesn't kill a correct proof; two dissents does. Stop launching verifiers once the outcome is decided — saves ~30% of verification cost on clear cases.
 4. **Specification-gaming check first**: explicitly asks "is this the intended interpretation?" before solving. The #1 failure mode in prior work (50/63 "correct" answers solved the wrong reading).
 5. **Calibrated abstention**: will say "no confident solution" with partial results. Optimizes conditional accuracy, not coverage.
 6. **Presentation pass**: correctness and elegance are separate steps. The presentation agent gets the VERIFIED proof and finds the cleanest way to say it.
--- a/plugins/math-olympiad/skills/math-olympiad/evals/trigger_eval.json
+++ b/plugins/math-olympiad/skills/math-olympiad/evals/trigger_eval.json
@@ -1,23 +0,0 @@
 [
  {"query": "Solve this IMO problem: Let n ≥ 2 be an integer. Prove that...", "should_trigger": true},
  {"query": "Is this Putnam proof correct? Here's my attempt at B3...", "should_trigger": true},
  {"query": "Find a counterexample to: every continuous function on [0,1] is uniformly continuous", "should_trigger": true},
  {"query": "Prove this olympiad inequality: for positive reals a,b,c with a+b+c=1...", "should_trigger": true},
  {"query": "Help me with this USAMO geometry problem", "should_trigger": true},
  {"query": "Verify my solution to AIME 2024 problem 12", "should_trigger": true},
  {"query": "I think there's a gap in this competition proof, can you find it?", "should_trigger": true},
  {"query": "Simplify this proof — it feels overly complicated", "should_trigger": true},
  {"query": "Here's a conjecture from a math competition. Is it true?", "should_trigger": true},
  {"query": "What's the cleanest way to present this olympiad solution?", "should_trigger": true},
  {"query": "Help me verify the time complexity of this sorting algorithm", "should_trigger": false},
  {"query": "Write a Python function that checks if a number is prime", "should_trigger": false},
  {"query": "I'm doing research on the Riemann Hypothesis, where should I start reading?", "should_trigger": false},
  {"query": "Debug this proof assistant code — my Lean tactic isn't working", "should_trigger": false},
  {"query": "Explain the proof of the fundamental theorem of calculus to a high schooler", "should_trigger": false},
  {"query": "What's a good textbook for learning competition math?", "should_trigger": false},
  {"query": "Generate 10 practice problems similar to AIME level", "should_trigger": false},
  {"query": "Compute the integral of x^2 sin(x) dx", "should_trigger": false},
  {"query": "Review my research paper draft on analytic number theory", "should_trigger": false},
  {"query": "What's the difference between IMO and Putnam in difficulty?", "should_trigger": false}
 ]
--- a/plugins/math-olympiad/skills/math-olympiad/references/adversarial_prompts.md
+++ b/plugins/math-olympiad/skills/math-olympiad/references/adversarial_prompts.md
@@ -1,192 +0,0 @@
 # Adversarial Verifier Prompts — Math Olympiad
 Prompt bank for the verifier subagent. Fresh context: problem statement + cleaned solution, NO thinking trace. Agent has NO tools — pure reasoning only.
 **Source**: `shared/verifier_patterns_source.md`. Background: arXiv:2503.21934 showed self-verified 85.7% IMO success drops to <5% under human grading. These prompts are the human grader.
 **Verifier isolation**: You do NOT know how other verifiers voted. You are not told if this proof has been confirmed or refuted by anyone else. Assume you're the first and only reviewer. (Social proof — "3 others confirmed" — biases toward agreement.)
 ---
 ## Reasons to REFUTE (the taxonomy — look for ANY one of these)
 Your goal is to find ANY reason to refute. These are the seven categories a hole falls into:
 1. **Step doesn't follow** — The conclusion of some step is not implied by its premises. (Includes direction errors: A>B and C>D does NOT give A−C>B−D.)
 2. **Hypothesis not satisfied** — An invoked theorem needs a condition the proof never verified. (Pattern #5: "entire" ≠ "analytic in a disk".)
 3. **Claim false in small case** — A stated identity or bound fails at n=2, n=3, or the first nontrivial block. Mentally test it.
 4. **Tautological** — The "gap" at the end is the original problem in disguise. (Pattern #18: substitute the proof's own identities back in.)
 5. **Proves too much** — The argument's skeleton applies to a famous object and proves something open or false about it. (Pattern #4.)
 6. **Wrong interpretation** — Solves an easier reading of the problem than the intended one. (Pattern #60.)
 7. **Hand-wave at the crux** — "iterating and optimizing gives the result", "by standard methods", "the details are routine" — at exactly the step that ISN'T routine.
 If none of these fire after a genuine attempt, CONFIRM. Do not confirm because the proof *sounds* confident.
 ---
 ## 1. General Adversarial (default)
 You are an adversarial verifier. Below is a problem and a proposed solution.
 **You are NOT grading this. You are trying to BREAK it.** Assume the author is a strong student who made one subtle error that a sympathetic reader would gloss over. Your job is to find that error. If you cannot find one after genuinely trying, say so — but do not say so just because the solution is confidently written.
 Attack each step:
 - Is the claimed inequality actually in the claimed direction? Reason through a small case mentally.
 - Is every "clearly" / "obviously" / "it follows that" actually clear? These words often mark the exact spot where the author convinced themselves of something false.
 - Does every cited theorem's hypothesis actually hold? Check quantifiers: "for all" vs "there exists", pointwise vs average.
 - At each "WLOG": is generality actually preserved, or does the reduction discard the hard case?
 - Does the argument use a property that's true for the *generic* object but not the *specific* one in the problem?
 You have no tools. Reason about small cases in your head — do not claim to have "computed" anything.
 **Output format:**
 ```
 VERDICT: CORRECT | INCORRECT | GAP
 CONFIDENCE: high | medium | low
 ISSUE: [if INCORRECT/GAP: one-sentence location, then one-paragraph explanation. If CORRECT: the step you tried hardest to break and why it held.]
 ```
 ---
 ## 2. Pattern #4 — Would It Prove Too Much?
 You are an adversarial verifier running a single check: **does this argument prove something famously open or famously false?**
 Read the proposed solution. Ignore whether the proof is locally valid. Instead:
 1. Strip the argument down to its skeleton: what properties of the given objects does it *actually use*?
 2. Find the most famous object that shares exactly those properties. (If it bounds a sum using only "positive decreasing terms" — does the harmonic series have positive decreasing terms? If it uses only "multiplicative and bounded by 1" — does the Möbius function qualify?)
 3. Mentally rerun the argument on that substitute. What does it now prove?
 If the substitute conclusion is a known open problem or a known falsehood, the original proof has a gap. The gap is at the step where the argument stops working for the substitute — find that step. That step is silently using a property the author never stated.
 If the argument genuinely uses a property specific to the problem's object that the famous substitute lacks, say which property and where it's used.
 **Output format:**
 ```
 VERDICT: CORRECT | INCORRECT
 CONFIDENCE: high | medium | low
 SUBSTITUTE_TESTED: [what object you substituted]
 ISSUE: [if it proves too much: which step fails for the substitute, and what unstated property is needed. If not: which step uses the specific property and why the substitute fails there.]
 ```
 ---
 ## 3. Pattern #40 — One-Line-Proof-Too-Clean
 You are an adversarial verifier targeting short proofs. The solution below contains at least one step that is suspiciously brief — one line doing a lot of work.
 For the shortest load-bearing step in the solution:
 1. **Extract the general lemma.** Write down the most general claim the step is implicitly using. Not "for this sum" but "for any sum of this shape." Not "for the determinant" but "for any function of the matrix entries with this property."
 2. **Try to break the general lemma with a 2×2 case.** Two elements, two terms, a 2×2 matrix — the smallest nontrivial instance. Reason it through in your head. Can you find values where the general lemma fails?
 3. **Judge:**
   - If the general lemma survives your 2×2 attack: the step is probably fine.
   - If the general lemma FAILS at 2×2 but the specific instance in the proof still seems to work: the step is **INCORRECT as written**. There is special structure in the problem that makes it true, and the proof does not invoke that structure. The author got the right answer for the wrong reason.
 The classic failure: "rank depends only on support" — but [[1,1],[1,1]] has rank 1 and [[1,1],[1,−1]] has rank 2, same support. General lemma false; a specific instance was true because of a sign-factorization the proof never mentioned.
 **Output format:**
 ```
 VERDICT: CORRECT | INCORRECT | GAP
 CONFIDENCE: high | medium | low
 GENERAL_LEMMA: [the extracted general claim]
 2x2_TEST: [the instance you tried, and what it showed]
 ISSUE: [if the general lemma is false: what special structure the proof failed to invoke]
 ```
 ---
 ## 4. Pattern #18 — Tautological Reduction
 You are an adversarial verifier checking one thing: **did the solution argue itself in a circle?**
 The solution likely proceeds through a chain of reductions or equivalent reformulations, ending at a "final estimate" or "key inequality" that it then proves directly. Your task:
 1. List every identity, equality, or substitution the solution establishes along the way. (Things like "A = B + C", "the sum splits as X + Y", "by the earlier lemma, P = Q".)
 2. Take the FINAL claim — the one the solution presents as "and this is now easy" or "this follows from [standard fact]".
 3. Substitute the chain's OWN identities (from step 1) back into that final claim. Expand. Simplify.
 4. What do you get? If you recover the ORIGINAL problem — or something trivially equivalent to it — then the "reduction" is a tautology. The proof has done nothing; it renamed the problem and declared it solved.
 The trap: long chains feel like progress. "We've reduced it to bounding X!" is only progress if X is actually different from what you started with. Sometimes X is just the original, wearing a hat.
 **Output format:**
 ```
 VERDICT: CORRECT | INCORRECT | GAP
 CONFIDENCE: high | medium | low
 FINAL_CLAIM: [the claim the solution treats as the easy endpoint]
 SUBSTITUTED_BACK: [what it becomes after expanding the chain's own identities]
 ISSUE: [is it the original problem? trivially equivalent? genuinely simpler? say which and why]
 ```
 ---
 ## 5. Pattern #60 — Specification-Gaming
 You are an adversarial verifier checking one thing: **did the solution answer the easiest interpretation of the question instead of the intended one?**
 Read the problem statement alone. Before looking at the solution in detail:
 1. Write down 2–3 plausible readings of what the problem is asking. Pay attention to: scope of quantifiers ("find all" vs "find one"), what "determine" means (a formula? a characterization? an existence proof?), boundary cases (does n=0 or n=1 count? is the empty set allowed? are degenerate configurations included?).
 2. Rank them by how hard they would be to solve.
 3. Which reading did the solution actually address?
 If the solution addresses the EASIEST reading — and especially if the problem under that reading would be trivially short for its stated source (an IMO problem that becomes a two-liner is a red flag) — then be suspicious. Olympiad problems are calibrated to their point values. A final-problem that falls in three lines means you're probably not solving the final problem.
 Also check: did the solution prove something about *an* object when the problem asked about *all* such objects? Did it show *possibility* when the problem wanted *necessity*?
 **Output format:**
 ```
 VERDICT: CORRECT | INCORRECT | GAP
 CONFIDENCE: high | medium | low
 READING_SOLVED: [which interpretation the solution addresses]
 READING_INTENDED: [which interpretation you believe was intended, and why]
 ISSUE: [if they differ: what the solution is missing. If they match: why the easy reading is genuinely the intended one.]
 ```
 ---
 ## 6. Consecutive-Verify (5-pass loop)
 You are verifier pass {K} of 5. A solution passes only if all five independent verifiers agree.
 **Verify INDEPENDENTLY.** You have not seen — and must not imagine — what any other verifier said. Do not reason "this probably already got checked." Your vote is the only vote you control. If you wave something through on the assumption that another pass will catch it, and the other four passes reason the same way, a wrong solution ships.
 Read the problem. Read the solution. Trace every step yourself, from scratch.
 One bias to actively resist: when a solution is well-written, confident, and uses standard machinery correctly in *most* places, you will be inclined to trust the one place you can't quite follow. **Invert this.** Well-written and confident is exactly what a subtly wrong solution looks like — the author convinced themselves before they convinced the math. The place you can't quite follow is the place to press hardest.
 You have no tools. Reason through small cases mentally; do not claim numerical verification.
 **Output format:**
 ```
 VERDICT: CORRECT | INCORRECT | GAP
 CONFIDENCE: high | medium | low
 PASS_NUMBER: {K}
 ISSUE: [if INCORRECT/GAP: exact step and why. If CORRECT: the step you found hardest to verify, and the reasoning that convinced you it holds.]
 ```
 ---
 ## 7. Adversarial Brief (for the reviser when pattern #40 fires)
 Use this instead of a general "fix the hole" prompt when a verifier flagged a one-line lemma whose general form is false. This framing forces a binary — the reviser cannot return "looks fine."
 > **Adversarial brief**: The principle "[extracted general lemma]" is obviously false in general — [trivial counterexample, e.g., [[1,1],[1,1]] has rank 1 and [[1,1],[1,−1]] has rank 2, same support].
 >
 > So exactly one of these is true, and your job is to determine which:
 >
 > **(A)** The conclusion holds for a DIFFERENT reason specific to this case. Find that reason. What structure does [the specific object in the problem] have that [the counterexample] lacks? That structure is the real proof.
 >
 > **(B)** The proof is wrong and the conclusion fails at [concrete prediction of where it diverges — e.g., "the first case where the block is ≥2×2, which is m=4"].
 >
 > Return (A) with the special structure identified, or (B) with the failure point. "The original proof is actually fine" is not an available answer — the general lemma is false, so either something saves this instance or nothing does.
 The best outcome is (A) — the thesis survives AND you learn why. The corrected proof is more informative than the false one.
 **Output format:**
 ```
 RESOLUTION: (A) SPECIAL_STRUCTURE | (B) CONCLUSION_FALSE
 IF (A): The structure [specific object] has that [counterexample] lacks: [...]. Revised proof: [...]
 IF (B): Fails at [parameter/case]. Reason: [...]
 ```
--- a/plugins/math-olympiad/skills/math-olympiad/references/attempt_agent.md
+++ b/plugins/math-olympiad/skills/math-olympiad/references/attempt_agent.md
@@ -1,53 +0,0 @@
 # Solver-Refiner Agent Prompt
 You are solving a competition math problem. You have NO tools — pure reasoning only.
 ## Your process (iterate internally until done)
 **Round 1: Solve**
 Think deeply. Produce a complete solution.
 **Round 2: Self-improve**
 Reread your solution. Fix any errors or gaps you find. This is your chance to catch your own mistakes before a grader does.
 **Round 3: Self-verify**
 Switch roles. You are now a strict IMO grader. Check every step. Classify each issue as:
 - **Critical Error**: breaks the logical chain (e.g., claiming A>B and C>D implies A-C>B-D)
 - **Justification Gap**: conclusion may be correct but argument incomplete
 If you find issues: note them, go back to your solver role, correct the solution, verify again. Repeat up to 5 times.
 **Stop when**: Either your self-verification passes cleanly 2 times in a row, OR you've done 5 correction rounds, OR you're certain the approach is fundamentally wrong.
 ## Core principles (from Yang-Huang IMO25)
 - **Rigor is paramount**: A correct final answer from flawed reasoning is a failure.
 - **Honesty about completeness**: If you cannot find a complete solution, say so. Present significant partial results (key lemma proven, one case resolved, a bound without achievability). Do NOT guess or hide gaps.
 - **Use TeX**: All mathematics in `$...$` or `$$...$$`.
 ## Output format (ONLY your FINAL state after all rounds — not the intermediate iterations)
 ```
 **Verdict**: complete solution | partial result | no progress
 **Rounds**: [how many self-verify→correct cycles you ran]
 **Method**: [one paragraph: the key idea]
 **Detailed Solution**:
 [Full step-by-step proof. Every step justified. No "clearly" or "obviously" — justify everything.]
 **Answer**: [if the problem asks for a specific value/set/characterization]
 **Self-verification notes**: [what you caught and fixed; any remaining concerns]
 ```
 ---
 PROBLEM:
 {statement}
 HINT: {angle}
--- a/plugins/math-olympiad/skills/math-olympiad/references/known_constructions.md
+++ b/plugins/math-olympiad/skills/math-olympiad/references/known_constructions.md
@@ -1,25 +0,0 @@
 # Construction Patterns
 Methodological patterns for finding optimal constructions. No specific problem answers.
 ## Spread vs cluster
 For optimization problems over permutations/configurations: the **symmetric choice (identity, diagonal, regular spacing) is often the worst case, not the best**. The intuition "symmetric = optimal" fails when the objective rewards *large substructures* that symmetry prevents.
 **When to suspect this**: The problem asks to maximize the size of something (tiles, intervals, independent sets) subject to a one-per-row/one-per-column constraint. The symmetric placement makes the forbidden region a contiguous band, leaving only thin slivers. Spreading the forbidden positions leaves fat windows.
 **What to try**: Partition into √n groups, assign each group to a residue class mod √n. Within a group, place in reverse order. This makes any contiguous block of √n rows/columns have its forbidden positions spread across all residue classes.
 ## Moment curve for distinctness
 When you need n objects in ℝ^k where "any k are independent" (or similar genericity), the moment curve `(1, t, t², ..., t^{k-1})` at n distinct parameter values gives this for free. Vandermonde determinants are nonzero, so any k of the vectors are linearly independent.
 **Rank-1 from vectors**: If you need matrices instead of vectors, rank-1 idempotents `A_i = v_i w_i^T` (projection onto `span(v_i)` along a complementary hyperplane) turn vector genericity into commutator conditions. `[A_i, A_j] = 0` iff a specific determinant vanishes.
 ## When brute-force reveals √n
 If brute-forcing n=2..8 gives a sequence that fits `an + b√n + c` better than `an + b`, the optimal structure has √n-sized blocks. Look for a construction parameterized by k where k=√n balances two competing costs (e.g., k things each of size n/k).
 ## Avoid: storing specific answers here
 This file is for construction *techniques*, not solutions. If you find yourself writing "the answer to Problem X is Y," delete it.
--- a/plugins/math-olympiad/skills/math-olympiad/references/model_tier_defaults.md
+++ b/plugins/math-olympiad/skills/math-olympiad/references/model_tier_defaults.md
@@ -1,44 +0,0 @@
 # Model Tier Defaults
 Parameters scale with model capability. Budget is not the constraint — the constraints are diminishing returns (more voters stop helping past a point) and the asymmetric noise floor (Haiku verifiers are individually less reliable, so the right response is width not depth).
 ## Haiku 4.5
 Width compensates for per-sample noise. Scaffolding is where the leverage is.
 - **Parallel solvers**: 12 (wide fan — each individual solve is weaker, so cast a wider net)
 - **Vote budget**: 7 verifiers, need 5-confirm / 3-refute (pigeonhole exit: stop when outcome decided)
 - **Abstain threshold**: 3 consecutive revise cycles fail
 - **Pattern sweep**: all 12 patterns — Haiku can follow a checklist, the patterns are the scaffold
 - **Presentation pass**: yes, 3 drafts, comparator picks cleanest. Haiku's raw output is rougher, so this matters MORE not less.
 - **Rationale**: The skill's value is highest where the base model is weakest. Give Haiku the full harness. The 3-refute threshold (higher than Sonnet's 2) accounts for Haiku verifiers being individually noisier — don't let 2 confused Haikus kill a correct proof.
 ## Sonnet 4.6
 Balanced.
 - **Parallel solvers**: 6
 - **Vote budget**: 5 verifiers, need 4-confirm / 2-refute
 - **Abstain threshold**: 3 consecutive revise cycles fail
 - **Pattern sweep**: all 12
 - **Presentation pass**: 2 drafts, comparator picks cleaner
 - **Rationale**: 4-of-5 tolerates one flake. 2 dissents is signal.
 ## Opus 4.6 / Capybara
 Depth. Each sample is strong, so invest in making the adversarial pass harder.
 - **Parallel solvers**: 4
 - **Vote budget**: 5 general verifiers (4-confirm / 2-refute) PLUS one dedicated verifier per pattern in `verifier_patterns.md` (12 targeted attacks). Any pattern-specific HOLE FOUND counts toward refute.
 - **Abstain threshold**: 5 consecutive revise cycles fail (trust the model's ability to eventually fix)
 - **Pattern sweep**: all 12, each with its own dedicated agent
 - **Presentation pass**: 3 drafts with different instructions ("most elegant," "most elementary," "shortest"), comparator picks the best. Strong models can genuinely produce different *styles* of proof.
 - **Rationale**: Opus/Capybara can execute the deep patterns (#19 base-vs-derived, #22 mean-first) that need real mathematical judgment. The 12 dedicated pattern passes are where the model's capability is best spent — it's the difference between "be skeptical" and "check THIS specific thing."
 ## On the pigeonhole exit
 Kept at all tiers — not because of cost, but because once `inflight >= confirm_needed + refute_needed - 1`, the remaining votes carry no information regardless of how they land. Launching them anyway is pure latency.
 ## Identifying the tier
 If the orchestrating session doesn't know which model it is, default to Sonnet configuration. A reasonable heuristic: ask the model to self-identify in its first response and match against `haiku`/`sonnet`/`opus`/`capybara` in the output.
--- a/plugins/math-olympiad/skills/math-olympiad/references/presentation_prompts.md
+++ b/plugins/math-olympiad/skills/math-olympiad/references/presentation_prompts.md
@@ -1,109 +0,0 @@
 # Presentation Pass — Prompts and Templates
 **Premise**: Aletheia's PDFs are beautiful; raw IMO output is not. The difference is a *presentation pass*: after a proof is **verified correct**, a fresh agent — one who didn't sweat through the discovery — finds the cleanest way to say it. The discoverer is too attached to the scaffolding.
 The Erdős paper even criticizes Aletheia's *own* output: *"somewhat overkill; any f whose inverse is at most [X] would suffice, no need to take the double exponential."* The presentation pass is where overkill goes to die.
 ---
 ## 1. The Presentation Pass Prompt
 Paste this to a **fresh subagent** along with the verified proof. The agent must not have discovery-context; that's the point.
 > You are given a **verified, correct proof**. Your job is not to check it — it is correct. Your job is to find the **cleanest presentation**. The order it was discovered in is almost never the order it should be read in.
 >
 > Work through these questions in order:
 >
 > **Hindsight shortcuts.** Now that you know the answer, is there a 3-line argument? The discoverer built machinery to *find* the key step; you already *have* the key step. Can the machinery be discarded? (Classic: a long case-bash that, in hindsight, collapses once you spot the invariant.)
 >
 > **Overkill.** Is any bound stronger than needed? Any construction more general than the problem requires? If a double exponential works but a linear function also works, use the linear one — the reader will wonder what the double exponential is hiding. Match the strength of each tool to the strength of what it's proving.
 >
 > **What to cut.** Which steps *verify* without *illuminating*? Discovery leaves a debris field: sanity checks, dead ends backed out of, "note that X (we won't use this)". Delete them. If a paragraph can be removed and the proof still compiles in the reader's head, remove it.
 >
 > **Lemma granularity.** Inline a lemma if it's used once and the proof is ≤3 lines. Keep it standalone if it's used twice, or if its *statement alone* clarifies the structure (even with a 1-line proof). Name standalone lemmas descriptively — "Combinatorial dimension bound", not "Lemma 2".
 >
 > **Order.** Lead with the main statement. Then the one idea that makes it work. Then the details. Isolate the one genuinely clever step — there's almost always exactly one — and let everything else be obviously routine *by contrast*.
 >
 > **Step names.** Number steps *and* name them: "**Step 3: Fourier inversion and translation invariance.**" The name is a promise to the reader about what this block accomplishes. Signpost reductions explicitly: "We are reduced to showing that…"
 >
 > Output clean LaTeX using the template below. Aim for: a strong grad student could reconstruct every suppressed detail, a professor could skim the step names alone and nod.
 ---
 ## 2. LaTeX Output Template
 Minimal preamble — Aletheia's environments, none of its ornament. No `tcolorbox`, no custom colors.
 ```latex
 \documentclass[11pt]{article}
 \usepackage[margin=1.25in]{geometry}
 \usepackage{amsmath, amssymb, amsthm, mathtools}
 \usepackage[shortlabels]{enumitem}
 \usepackage{hyperref}
 \theoremstyle{plain}
 \newtheorem{theorem}{Theorem}
 \newtheorem{lemma}{Lemma}
 \newtheorem{claim}{Claim}
 \newtheorem{proposition}[theorem]{Proposition}
 \theoremstyle{definition}
 \newtheorem{definition}[theorem]{Definition}
 \newtheorem*{remark}{Remark}
 \begin{document}
 \section*{Problem}
 % Restate the problem exactly. No paraphrase.
 \section*{Solution}
 \begin{theorem}
 % State what you will prove, in full. If the answer is "yes" or "no"
 % or a specific value, state it here so the reader isn't kept in suspense.
 \end{theorem}
 % If a lemma is reused or structurally load-bearing, state it before
 % the main proof. One-shot verifications get inlined below.
 % \begin{lemma}\label{lem:key}
 %   ...
 % \end{lemma}
 % \begin{proof} ... \end{proof}
 \begin{proof}[Proof of Theorem]
 \textbf{Step 1: [Descriptive name — what this step accomplishes].}
 % e.g. "Reduction to the compact case." / "The key invariant."
 % Display important equations; inline routine ones.
 % End a reduction step with: "We are reduced to showing that ..."
 \textbf{Step 2: [Name].}
 % ...
 \textbf{Step $n$: Conclusion.}
 % One or two sentences. Make the contradiction / induction close / final
 % computation land visibly.
 \end{proof}
 \end{document}
 ```
 **Style conventions lifted from the Aletheia samples:**
 - Display math for the equation a step *produces*; inline math for the algebra getting there.
 - Cite precisely when invoking a named result: *(Jacquet–Piatetski-Shapiro–Shalika, 1981)* — not "by a well-known theorem".
 - In contradiction proofs: state the false assumption plainly ("Suppose, for contradiction, that…"), and flag the collision plainly ("We are led to the contradiction $0 > 0$.").
 - Integer bounds earn the ceiling: if $d \ge n/k$ and $d \in \mathbb{Z}$, write $d \ge \lceil n/k \rceil$. Free sharpness.
 ---
 ## 3. Anti-Patterns to Catch
 The presentation agent should flag and fix these:
 - **Discovery-order exposition.** "First I tried X, which led me to notice Y…" — the reader doesn't care. State Y.
 - **Overkill constructions.** The tell: the bound you prove is parametrically stronger than what the next line consumes. Weaken it until it's tight.
 - **Proof by intimidation.** *"It is trivial to see that…"*, *"Obviously…"*, *"A standard argument shows…"* — if it's trivial, one sentence suffices. Write the sentence.
 - **Unnecessary generality.** Proving it for all $n$ when the problem asks about $n=3$ and the general case adds no insight, only indices.
 - **Orphan lemmas.** Stated, proved, cited once, three lines long. Inline it.
 - **Unlabeled case splits.** Five cases, no indication of why five or what distinguishes them. Name the cases; say upfront which one carries the content.
 - **Missing signposts.** A page of computation with no "we are reduced to" / "it suffices to show" markers. The reader shouldn't have to reverse-engineer your strategy.
--- a/plugins/math-olympiad/skills/math-olympiad/references/solver_heuristics.md
+++ b/plugins/math-olympiad/skills/math-olympiad/references/solver_heuristics.md
@@ -1,80 +0,0 @@
 # Solver Heuristics (Pólya + Olympiad Practice)
 For solver subagents. These are the moves to try when the direct approach stalls.
 ## Pólya's core moves (from "How to Solve It")
 **Have you seen a related problem?** Not the same problem — one with the same UNKNOWN, or the same STRUCTURE. A problem about covering points with lines has the same shape as one about covering lattice points with arithmetic progressions.
 **Specialize.** If you can't solve the given problem, solve n=3, n=4, n=5 by hand. The pattern is often the proof. (But: test past the first nontrivial case — n≤3 may be degenerate.)
 **Generalize (inventor's paradox).** The more ambitious problem sometimes has MORE structure and is easier. "Prove for all primes" might be harder than "prove for all integers" if the integer case has a clean induction.
 **Drop a condition.** What happens if you relax one hypothesis? Does the result become trivially false? Where? That WHERE is often the key step — the point where the condition is load-bearing.
 **Work backwards.** Start from what you want to prove. What would imply it? What would imply THAT? If this chain meets something you can prove directly, you have the proof (reversed).
 **Auxiliary element.** Introduce something not in the problem — a new variable, a reflection, a well-chosen function. Olympiad geometry lives on this (auxiliary points, circles).
 ## Olympiad-specific moves
 **Find the invariant.** If there's a process (game, transformation, iteration), what quantity is preserved? Parity, sum, product modulo something.
 **Find the extremal.** Take the LARGEST, or SMALLEST, or LEFTMOST object. Extremal choices often have extra properties that generic choices don't.
 **Double count.** Count the same thing two ways. Incidences, edges, sums over pairs.
 **Coloring / parity.** Can you 2-color the objects so the claim becomes a parity statement?
 **Smoothing / adjusting.** For inequalities: if you perturb two variables closer together (or further apart), does the expression increase or decrease? Extremize.
 **Symmetry → WLOG.** If the problem is symmetric in x,y,z, you can assume x≤y≤z. But only if the conclusion is ALSO symmetric.
 ## Geometry-specific moves
 Standard angles (induction, invariants, extremal) are often wrong-shaped for olympiad geometry. Use these instead:
 **Coordinate bash.** Place the configuration in coordinates. Choose them to kill degrees of freedom (origin at a center, axis along a line). Grind out the algebra. Ugly but reliable.
 **Auxiliary point.** Introduce a point not in the problem — a reflection, a second intersection, the point where two lines "should" meet. Often the key construction is finding the right extra point.
 **Power of a point.** For any point P and circle ω, PA·PB is the same for every line through P meeting ω at A, B. Use it to turn ratios into equalities.
 **Spiral similarity / rotation.** Two directly similar triangles are related by a spiral similarity (rotation + scaling about a fixed point). Find that point — it often lies on a circle you already have.
 **Inversion.** When there are many circles or tangencies, invert about a well-chosen center. Circles through the center become lines; tangencies become simpler tangencies.
 **Angle chase.** Cyclic quadrilaterals give equal angles. Tangent-chord gives an angle equal to the inscribed angle. Chase around the figure.
 ## Geometry-specific moves (these are DIFFERENT)
 The standard angles (invariant, extremal, induction) don't fit circles/circumcenters/orthocenters. Geometry needs:
 **Coordinate bash.** Place one point at origin, another on the x-axis. Compute everything explicitly. The algebra is heavy but mechanical. For two circles with centers M, N and radii r, R: set M=(0,0), N=(d,0), then the intersection points have x-coordinate (r²+d²−R²)/2d and everything follows.
 **Auxiliary point.** Introduce a point not in the problem — the reflection, the foot of a perpendicular, the second intersection. Olympiad geometry lives on finding the right extra point.
 **Power of a point.** For point P and circle Γ: PA·PB is constant for any line through P meeting Γ at A,B. This converts circles to products.
 **Inversion.** Circles through the center become lines. Sometimes the inverted problem is trivial.
 **Angle chasing / cyclic quads.** Four points are concyclic iff opposite angles sum to π. Chase angles until enough equalities force concyclicity.
 ## Recurrence-specific trap
 For recurrences like b_{n+1} = P(b_n) where P is polynomial degree ≥ 2: **b_n grows doubly-exponentially**. You cannot compute b_30 exactly — it has trillions of digits. Work in ℤ/2^m (or ℤ/p^m) from the start. Prove b_n ≡ r_n (mod 2^m) by induction on n, NOT by computing b_n.
 ## When the answer involves √n or log n
 These answers often come from a structure that is NOT the obvious/symmetric one. The diagonal, the identity, the "natural" choice frequently gives the WORST case, not the best — it clusters the constraint in a way that prevents large substructures.
 **For pure-reasoning solvers**: Before claiming the symmetric choice is optimal, ask "what if I deliberately break the symmetry?" For grid/covering problems: what if the gaps are SPREAD OUT instead of clustered? For sequences: what if the extremal sequence is NOT constant or linear?
 **For deep-mode agents**: Brute-force n=3..8 before theorizing. If the formula that fits is n+c√n instead of cn, the structure has √n-sized blocks.
 ## The Look Back phase (after you have a proof)
 - **Can you check it?** Plug in small cases. Does n=3 give what your formula says?
 - **Can you prove it differently?** A second proof is a verification. And often shorter.
 - **Is your bound tight?** If you proved ≤ N and the answer is exactly N, find the extremal case. If you can't, your bound might be loose.
 - **What did you actually use?** Sometimes you used less than all the hypotheses — the real theorem is stronger.
--- a/plugins/math-olympiad/skills/math-olympiad/references/verifier_patterns.md
+++ b/plugins/math-olympiad/skills/math-olympiad/references/verifier_patterns.md
@@ -1,135 +0,0 @@
 # Verifier Patterns — Olympiad Subset
 For a verifier with **no tools, only reasoning**. Each pattern is a mental check you can run on a candidate proof. These are the specific ways proofs go wrong that self-verification misses. (Source: 59 patterns from real research sessions; these 13 need no grep/fetch/compute.)
 Run #18 and #19 after any positive finding. Run #40 on any proof that feels too short.
 ---
 ### Pattern 4: Would it prove a famous open problem?
 **The check**: Specialize the claimed theorem to the most famous object in its class (ζ(s), the Ramsey number, the Collatz map). Does the specialization settle a known open problem?
 **What it catches**: A bound "for all Dirichlet series with property P" that, applied to ζ, would prove Lindelöf — the proof treated arithmetic input as generic.
 **How to run it**: Find the step where the argument uses a generic property. Ask: does ζ (or the canonical hard instance) actually have this property? The gap is always where it doesn't.
 ---
 ### Pattern 5: Outside the hypothesis class
 **The check**: For each example claimed to satisfy a theorem, re-derive the hypotheses from the definition — don't trust the label.
 **What it catches**: "f is entire of order ≤1, so by Thm 3.1…" — but Thm 3.1 needs f analytic in a *full disk* around 0; a natural boundary on the imaginary axis blocks it.
 **How to run it**: Write out the theorem's hypothesis verbatim. For each claimed instance, check inclusion from scratch. Watch for near-synonyms ("bounded" vs "bounded on the line"; "entire" vs "analytic on a domain").
 ---
 ### Pattern 6: Divergent sum behind analytic continuation
 **The check**: When a divergent-looking sum is "bounded by ζ(s)" or similar, evaluate the bounding function at the boundary of the claimed range.
 **What it catches**: "Σ 1/n ≤ ζ(1)" — but ζ(1) is a pole. The analytic continuation of a sum is not the sum.
 **How to run it**: Mentally substitute the boundary value of the parameter into the bounding expression. A pole or ∞ there means the original sum diverges, regardless of what the continued function says elsewhere.
 ---
 ### Pattern 10: Same keywords, different theorem
 **The check**: When a cited theorem has the right *words* but the fit feels off — check pointwise vs averaged, uniform vs a.e., finite vs asymptotic.
 **What it catches**: Invoking "Fourier decay ⇒ bound" for a pointwise estimate, when the cited decay theorem needs curvature and you only have it on average.
 **How to run it**: State precisely what the proof *needs* (pointwise? for all x? with what uniformity?). State what the cited theorem *gives*. Sometimes the weaker version is enough and this *closes* a gap; sometimes the gap is real.
 ---
 ### Pattern 17: Test past the first nontrivial block
 **The check**: Before accepting a pattern from small cases, identify where the structure first becomes nontrivial. Confirm the pattern holds *past* that threshold.
 **What it catches**: "Checked m = 1, 2, 3: all blocks have rank 1." But m ≤ 3 gives only 1×2 blocks — rank 1 is forced. First 2×2 appears at m = 4, and there the claim fails.
 **How to run it**: Ask "what makes the small cases easy?" Find the parameter value where that degeneracy disappears. The claim must survive at least one case beyond it.
 ---
 ### Pattern 18: Tautological reduction
 **The check**: When a reduction chain ends at "estimate X would finish it," substitute the chain's own already-proven identities into X.
 **What it catches**: "Suffices to show ∫|P|² ≤ C·H." But the chain itself proved ∫|P|² = H + 2Re(OD') *exactly*. So X is just the original conjecture plus a cosmetic shift — not a reduction.
 **How to run it**: Take each identity the chain proved along the way and plug it into the "final gap." If you recover the starting conjecture (or something at least as strong), the chain went in a circle.
 ---
 ### Pattern 19: Derived obstruction vs base obstruction
 **The check**: When the same obstruction kills 3+ independent approaches, compute the disputed property on the *original* object — before any reduction.
 **What it catches**: "det(Hessian) = 0, ruled surface, decoupling fails" — for the phase log(2πm−θ). But the *base* phase is nθ − t·log(n), and *its* Hessian has det = −1. The obstruction lived in the proxy.
 **How to run it**: Name the object the obstruction is *about*. Is it the thing you started with, or something a reduction produced? Go back to the start and check directly.
 ---
 ### Pattern 22: Absolute-sum gives O(K); compute the mean first
 **The check**: Before accepting that Σₖ Xₖ = O(1) is "too hard because |Xₖ| summed gives O(K)," compute the mean of Xₖ over the varying parameter.
 **What it catches**: Weyl equidistribution gives mean(Xₖ) = 0 *exactly*. So Σ Xₖ is a fluctuation sum — the target is Var = O(1), and half the conjecture falls in one line.
 **How to run it**: Separate Xₖ into mean + fluctuation. If orthogonality/equidistribution forces the mean to zero, you were never fighting K terms of size 1 — you were fighting √K terms (or better). Rewrite the target.
 ---
 ### Pattern 23: Formula's scope never stated
 **The check**: For any identity used in the proof, ask: was this proved for the general case, or for a special case that the author silently generalized?
 **What it catches**: "κ₄ = 3d − 1" was derived for 2-piece Cantor sets. The proof applies it to an m-piece set, where the real formula involves additive energy and can differ by a constant factor.
 **How to run it**: Trace the identity to where it was first introduced. What were the standing assumptions *there*? Check that those assumptions still hold at the point of use.
 ---
 ### Pattern 35: Count quantifiers before diagonalizing
 **The check**: Before "diagonalize against class C using property P," ask whether *certifying* P is an ∃-statement or a ∀-statement.
 **What it catches**: "Find an x not computed by any small circuit" — but verifying "no small circuit computes x" is a ∀ over circuits. Your diagonalizer is in Σ₂, not NP. (This is *why* Kannan gives Σ₂ᴾ ⊄ SIZE, not NP ⊄ SIZE.)
 **How to run it**: Write the diagonalization as a formula. Count alternations. If you need ∀∃ to describe the witness, you've jumped a level in the hierarchy.
 ---
 ### Pattern 40: One-line-proof-too-clean
 **The check**: Extract the proof's key step as a lemma in *full generality* — not specialized to the objects at hand. Try a 2×2 counterexample to the general lemma.
 **What it catches**: "rank depends only on monomial support" — but [[1,1],[1,1]] has rank 1 and [[1,1],[1,−1]] has rank 2 with the same support. The general lemma is false; the specific case holds because sgn(π) = f(S)·g(T) factors. *That's* the real proof.
 **How to run it**: If the general lemma dies but the specific conclusion survives numerically, there's hidden structure. Find it. The real proof goes through *that*, not the false lemma.
 ---
 ### Pattern 58: Quantifier direction on domain size
 **The check**: Before claiming one statement is "strictly stronger" than another because its domain is smaller — check whether the quantifier is ∀ or ∃.
 **What it catches**: "∀ S ∈ D, φ(S)" over a *smaller* D is *weaker* (fewer obligations). "∃ S ∈ D, φ(S)" over smaller D is *stronger* (fewer candidates). Backwards strength claims swap these.
 **How to run it**: Say the statement out loud with the quantifier explicit. Shrinking the domain under ∀ drops requirements. Shrinking under ∃ drops witnesses. Only one direction is "harder."
 ---
 ### Pattern 60: Easiest-interpretation trap
 **The check**: Before solving, write down 2–3 readings of the problem statement. Flag whichever one makes the problem trivial.
 **What it catches**: 63 "technically correct" solutions; only 13 "meaningfully correct." The gap: solving the easiest grammatically-valid reading instead of the intended one. Olympiad problems often *plant* an easy misreading.
 **How to run it**: Ask "under which reading is this a real problem?" If your interpretation makes it a one-liner and the problem is worth 7 points, you've probably chosen wrong. Solve the hard reading; note the easy one only as a remark.
--- a/plugins/math-olympiad/skills/math-olympiad/scripts/check_latex.sh
+++ b/plugins/math-olympiad/skills/math-olympiad/scripts/check_latex.sh
@@ -1,4 +0,0 @@
 #!/bin/bash
 # Exit 0 if a LaTeX compiler is available, 1 otherwise.
 # Used by SKILL.md to decide whether to offer PDF compilation.
 command -v pdflatex >/dev/null 2>&1 || command -v xelatex >/dev/null 2>&1
--- a/plugins/math-olympiad/skills/math-olympiad/scripts/compile_pdf.sh
+++ b/plugins/math-olympiad/skills/math-olympiad/scripts/compile_pdf.sh
@@ -1,54 +0,0 @@
 #!/bin/bash
 # Compile a LaTeX proof body into a standalone PDF.
 # Usage: compile_pdf.sh <body.tex> <output_dir>
 # The body.tex should contain just the \begin{document}...\end{document} contents
 # (theorem, proof, lemmas). This script wraps it in a minimal preamble.
 set -euo pipefail
 BODY="$1"
 OUTDIR="${2:-.}"
 BASENAME=$(basename "$BODY" .tex)
 FULL="$OUTDIR/${BASENAME}_full.tex"
 cat > "$FULL" <<'PREAMBLE'
 \documentclass[11pt]{article}
 \usepackage[margin=1.25in]{geometry}
 \usepackage{amsmath, amssymb, amsthm, mathtools}
 \usepackage[shortlabels]{enumitem}
 \usepackage{enumitem}
 \usepackage[colorlinks=true, linkcolor=blue, citecolor=blue]{hyperref}
 \theoremstyle{plain}
 \newtheorem{theorem}{Theorem}
 \newtheorem{lemma}[theorem]{Lemma}
 \newtheorem{claim}[theorem]{Claim}
 \newtheorem{proposition}[theorem]{Proposition}
 \newtheorem{corollary}[theorem]{Corollary}
 \theoremstyle{definition}
 \newtheorem{definition}[theorem]{Definition}
 \newtheorem{remark}[theorem]{Remark}
 \begin{document}
 PREAMBLE
 cat "$BODY" >> "$FULL"
 cat >> "$FULL" <<'CLOSE'
 \end{document}
 CLOSE
 if command -v pdflatex >/dev/null 2>&1; then
    COMPILER=pdflatex
 elif command -v xelatex >/dev/null 2>&1; then
    COMPILER=xelatex
 else
    echo "No LaTeX compiler found" >&2
    exit 1
 fi
 cd "$OUTDIR"
 $COMPILER -interaction=nonstopmode -halt-on-error "${BASENAME}_full.tex" >/dev/null
 $COMPILER -interaction=nonstopmode -halt-on-error "${BASENAME}_full.tex" >/dev/null
 echo "$OUTDIR/${BASENAME}_full.pdf"