Unquoted colon in 'calibrated confidence: will say' broke strict YAML parsing. CC's parser is lenient but fix for robustness.
18 KiB
name, description, version
| name | description | version |
|---|---|---|
| math-olympiad | Solve competition math problems (IMO, Putnam, USAMO, AIME) with adversarial verification that catches the errors self-verification misses. Activates when asked to 'solve this IMO problem', 'prove this olympiad inequality', 'verify this competition proof', 'find a counterexample', 'is this proof correct', or for any problem with 'IMO', 'Putnam', 'USAMO', 'olympiad', or 'competition math' in it. Uses pure reasoning (no tools) — then a fresh-context adversarial verifier attacks the proof using specific failure patterns, not generic 'check logic'. Outputs calibrated confidence — will say 'no confident solution' rather than bluff. If LaTeX is available, produces a clean PDF after verification passes. | 0.1.0 |
Math Olympiad Solver
The five things that change outcomes
- Strip thinking before verifying — a verifier that sees the reasoning is biased toward agreement. Fresh context, cleaned proof only.
- "Does this prove RH?" — if your theorem's specialization to ζ is a famous open problem, you have a gap. Most reliable red flag.
- Short proof → extract the general lemma — try 2×2 counterexamples. If general form is false, find what's special about THIS instance.
- Same gap twice → step back — the case split may be obscuring a unified argument. Three lines sometimes does what twelve pages couldn't.
- Say "no confident solution" — wrong-and-confident is worse than honest abstain.
Tool policy: Solvers and verifiers use THINKING ONLY in the tight-budget workflow. Competition math is reasoning. Computation is for deep mode (§6c), and even then bounded — a recurrence that's doubly-exponential can't be computed past n~30, work mod 2^m instead.
When to use which approach
| Problem | Approach | Verification |
|---|---|---|
| AIME numeric answer | Best-of-N → majority vote | Answer check only |
| Olympiad proof (IMO/Putnam/USAMO) | Full workflow below | 5-pass adversarial |
| "Is this proof correct?" | Skip to verification (step 4) | Adversarial + spec-gaming |
| Full problem set (e.g. all 6 from a competition) | Sequential: one full workflow per problem, collect results, compile single PDF | Per-problem adversarial |
Batch in one Workflow: Set opts.label on every agent() call to include the problem ID (e.g., label: "P3:solver:2"). Without labels, 36 results come back with no problem association. Run problems in parallel — the label is what matters, not ordering.
For a full problem set
Launch one solver workflow per problem (same VERBATIM prompt, different statement). Run them in parallel. When all return, run adversarial verification per problem. Problems that pass get their proof in the PDF; problems that abstain get "No confident solution" with partial notes.
Don't try to solve all N problems in one agent's context — each problem needs its own thinking budget and its own fresh-context verifier. The composition is mechanical: collect the per-problem outputs, fill in LaTeX sections, compile once. | "Simplify this proof" | Skip to presentation (step 8) | — |
The Workflow
1. Interpretation check (30 seconds, catches 50/63 of one class of errors)
Before solving anything, identify the interpretation.
Read the problem statement. List 2-3 ways it could be interpreted. For each: is this reading TRIVIAL? If one reading makes the problem easy and another makes it hard, the hard one is almost certainly intended. State which interpretation you're solving and WHY you believe it's the intended one.
The Aletheia case study found 50 of 63 "technically correct" solutions were for the wrong interpretation. Olympiad problems often have a trap easy reading.
2. Generate candidates with internal refinement (parallel, thinking only)
Launch 8-12 attempt agents in parallel. Each agent internally iterates — solve → self-improve → self-verify → correct → repeat. This is the Yang-Huang structure that achieves 85.7% on IMO: one-shot solving isn't enough; per-attempt refinement matters.
The Agent tool cannot enforce tool restriction. Subagents get the full tool set. The only mechanism is the prompt. Use this prompt VERBATIM — do not summarize, do not synthesize your own:
NO COMPUTATION. Do not use Bash, Python, WebSearch, Read, Write, or any tool that runs code or fetches data. Numerical verification is not a proof step. "I computed n=1..10 and the pattern holds" is not a proof.
(If your agent harness requires a StructuredOutput or similar return-mechanism tool call, that is NOT a computation tool — call it to return your answer. The restriction is on tools that DO work, not tools that REPORT work.)
Your internal process (iterate until done):
- Solve: Complete rigorous solution.
- Self-improve: Reread. Fix gaps before a grader sees it.
- Self-verify: Strict grader mode. Every step justified?
- Correct: Fix and re-verify. Up to 5 rounds.
- Stop: Self-verify passes twice clean, OR 5 rounds, OR approach fundamentally wrong.
A correct answer from flawed reasoning is a failure. If incomplete, say so honestly. Never hide gaps.
PROBLEM: <insert the problem statement here>
ANGLE: <insert one starting angle here>
The first two paragraphs are load-bearing. A session that writes its own prompt and omits them will produce subagents that grind Python for 30 iterations and confidently get wrong answers — a pattern that fits n≤10 but fails at n=100 is not a proof.
Starting angles (vary across agents — see references/solver_heuristics.md):
- Work out small cases (test past n=3)
- Look for an invariant or monovariant
- Consider the extremal case
- Try induction
- What symmetries?
- Work backwards
- Drop a condition — where does it become trivially false?
- Generalize (inventor's paradox — more structure is sometimes easier)
Each returns its FINAL state (not intermediate rounds):
**Verdict**: complete solution | partial result | no progress
**Rounds**: [how many verify→correct cycles]
**Method**: [key idea, one paragraph]
**Detailed Solution**: [full step-by-step, every step justified]
**Answer**: [if applicable]
**Self-verification notes**: [what you caught and fixed; remaining concerns]
Retry policy: If an agent fails or times out, retry once. Transient failures happen.
3. Clean the solution (context isolation — the #1 lever)
The thinking trace biases the verifier toward agreement — a long chain of reasoning reads as supporting evidence even when the conclusion is wrong. Before any verification, strip:
- All thinking-block content
- All "Let me try..." / "Actually wait..." / "Hmm" prose
- All false starts and backtracking
What remains: problem statement + clean final argument only.
Extract only the Method + Proof + Answer sections from each solver's output. The verifier never sees how the solver got there.
4. Adversarial verify (fresh context, pattern-armed)
For each cleaned solution, launch a fresh verifier agent. Fresh context: it sees only (problem statement + cleaned solution). No tools.
The verifier's job is to ATTACK, not grade. Load references/adversarial_prompts.md for the prompts. The key patterns it runs:
| Pattern | The check |
|---|---|
| #4 | Does this theorem specialize to a famous object (ζ, quadratic reciprocity, etc.) and prove something open about it? → gap |
| #18 | Substitute the proof's own intermediate identities into any "remaining gap." Recover the original claim? → tautological |
| #40 | Is any step a "one-line lemma"? Extract the GENERAL form. Find a 2×2 counterexample. If the general form is false, find what special structure saves THIS instance |
| #5 | For each invoked theorem: re-check hypotheses FROM SCRATCH. "Continuous on [0,1]" ≠ "continuous on ℝ" |
| #6 | Any infinite sum "bounded" via a regularized value? Check the boundary — if there's a pole there, the sum diverges |
Full pattern list: references/verifier_patterns.md
Verifier returns:
**Verdict**: HOLDS | HOLE FOUND | UNCLEAR
**If HOLE FOUND**:
- Location: [quote the problematic step]
- Pattern: [which check fired, or "other"]
- Why it breaks: [specific]
- Fixable?: [yes with X / no, fundamental]
5. Rank and vote-verify (asymmetric + early exit)
Rank solutions by (verdict, verifier confidence). Take the top one. Run up to 5 fresh verifier agents.
Asymmetric thresholds: 4 HOLDS to confirm, 2 HOLE FOUND to refute. Why asymmetric: one flaky verifier shouldn't kill a correct proof; but two independent dissents is a real signal.
Pigeonhole early exit: stop launching verifiers once the outcome is decided.
- 2 say HOLE FOUND → refuted, stop (save the remaining 3 calls)
- 4 say HOLDS → confirmed, stop (save the 5th)
- After 3 verifiers: if 2 HOLDS + 1 HOLE, launch 2 more (outcome undecided). If 3 HOLDS + 0 HOLE, launch 1 more (could still hit 4-1).
Dual context-isolation: each verifier is blind to (a) the solver's thinking trace — already stripped in step 3 — AND (b) other verifiers' verdicts. Each verifier thinks it's the first. No "3 agents already confirmed this" social proof.
A solver cannot verify its own solution. Different agent, fresh context.
5b. When one case won't close — step back before grinding
If a proof splits into cases and one case proves easily but the other resists: before grinding through the hard case, ask whether there's a route that makes the split disappear.
The pattern that saves you: the hard case's very hypothesis often implies something strong about an intermediate object you haven't looked at. Use that implication directly instead of the original chain.
Concrete shape: proving f(n) ≤ cn for a constrained function f, with a case split on a prime p dividing f(n). One branch closes by index arguments in (ℤ/p^e)*. The other branch resists — same group structure, but the arithmetic doesn't contradict. The fix: the hypothesis "p | f(n)" plugged back into the governing equation implies f(p) = p itself. Once you have that, a Fermat+Dirichlet argument kills both branches in three lines. The case split was a detour — it was splitting on a variable that, under the hypothesis, takes a known value.
Check when stuck on case B:
- What does case B's hypothesis imply about f at other inputs?
- Is there a different pair (a,b) to plug into the governing equation?
- Are you proving too much? (A cleaner contradiction needs less machinery.)
This is also a presentation-pass win: the split-free proof is shorter AND more general.
6. Revise (if needed)
If verification finds a hole: launch a reviser agent. It gets (cleaned solution + verifier's hole report). STILL no access to the original thinking — the reviser works from the hole, not by rereading how you got there.
A verifier found this issue in the proof:
[hole report]
Fix the proof. If the hole is fundamental (the approach doesn't work), say so and return **Verdict: no confident solution** with what partial progress remains.
For any step you cannot fully close, mark it inline: [GAP: specific description of what remains]. Gaps in the proof text, not in a separate list — they're greppable and the next reviser knows exactly where to look.
Up to 3 revise cycles. Then re-run the vote on the revised proof.
If pattern #40 fired (one-line-proof-too-clean), the reviser gets a stronger brief — the Adversarial Brief template from references/adversarial_prompts.md §7. It forces a binary: "the general lemma is obviously false (here's a 2×2 counterexample) — so either find what's special about THIS case, or find where the proof breaks." Can't return "looks fine."
6c. Deep mode (when tight-budget abstains)
The standard workflow is tight-budget: 8 solvers, ~15 min, pure reasoning. When it abstains, the problem may need more time, not more capability.
Deep mode is a single focused agent with:
- Unlimited time — no wall-clock pressure
- Targeted computation allowed — modular arithmetic checks, small-case enumeration, symbolic verification of identities. NOT exploratory brute force or unbounded recursion.
- The abstention reason as starting point — if verifiers found a specific gap, start there. If solvers never claimed complete, start from what they partially proved.
The archetype: a focused agent that gets the proven-so-far state plus "one case of Lemma 5 is open" — and finds a 3-line argument the case split was obscuring. Often under 10 minutes with almost no computation. Deep mode is about giving the problem sustained attention, not throwing compute at it.
What deep mode is NOT: open-ended exploration, literature search, looking up solutions, multi-day investigation. That's a different workflow (math-research). Deep mode is still "solve THIS problem yourself" — just without the clock.
NO WEB. NO LOOKUP. Deep mode may use Bash/Python for bounded computation, but NEVER WebFetch, WebSearch, or any network access. Finding the solution on AoPS or a blog is not solving the problem — it's cheating on an olympiad, and it teaches us nothing about the skill's actual capability. Put this at the TOP of the deep-mode prompt:
NO WEB ACCESS. Do not use WebFetch, WebSearch, or any tool that touches the internet. Do not look up this problem, its solution, or related problems. You are solving this yourself — the only allowed computation is local (Bash/Python for mod-k arithmetic, small-case enumeration n≤10, symbolic identity checks). If you invoke a web tool, the proof is void.
Computation bounds in deep mode (bug #8 lesson): A6's b_{n+1}=2b_n²+b_n+1 is doubly-exponential; b_99 has ~10^{2^98} digits. Never compute such objects exactly — work in ℤ/2^m, or track only v_p(·), or prove the recursion mod the quantity you care about. If a computation is running longer than 60 seconds, it's probably unbounded. Kill it and work symbolically.
Step 6d (not optional): After any ABSTAIN at the verify stage, automatically launch one deep-mode agent before writing the abstention into the output. Give it:
- The problem statement
- The best partial proof from tight-budget solvers
- The verifier gap descriptions (what specifically didn't close)
- The instruction: "NO WEB ACCESS — do not look up this problem or its solution. Bounded local computation allowed (mod 2^k, small cases n≤10, symbolic identity checks via Bash/Python only). 60-second computation limit. If n≤10 brute force reveals a pattern the tight-budget solvers missed, that pattern IS the proof structure."
The deep agent may find the construction the pure-reasoning solvers couldn't see. If it also abstains, THEN write the abstention. Do not skip this step — problems with √n or log n answers are often invisible to pure reasoning because the optimal structure is the asymmetric one.
Orchestrator self-restraint: The orchestrator itself must not web-search the problem "to help" the deep agent. If you're tempted to Fetch an AoPS thread "just to check the answer," don't — that contaminates the skill's output and misrepresents its capability.
7. Calibrated abstention
If 3 revise cycles all fail: stop and admit it.
**Verdict**: no confident solution
**What was tried**: [approaches]
**What WAS proven**: [any lemma or partial result that survived verification]
**Where it breaks**: [the unfixed hole]
Do NOT guess. A wrong confident answer is worse than an honest "couldn't solve it." The metric that matters is CONDITIONAL accuracy — when you say "solved," are you right?
8. Presentation pass (after correctness is established)
A VERIFIED-CORRECT proof is often not a BEAUTIFUL proof. The order you discovered it is rarely the best order to present it. Launch a fresh presentation agent with the verified proof.
Load references/presentation_prompts.md. The agent asks:
- What's the simplest way to say this?
- Which lemmas should be inlined? Which deserve to stand alone?
- Is anything OVERKILL? (constructing a double exponential when linear suffices)
- Now that we know the answer, is there a 3-line hindsight proof?
Output: LaTeX-formatted proof. If pdflatex is available (scripts/check_latex.sh returns 0), also compile to PDF via scripts/compile_pdf.sh.
Model tier defaults
Read references/model_tier_defaults.md for full details. Summary:
| Model | Solvers | Verify passes | Abstain after | Presentation |
|---|---|---|---|---|
| Haiku 4.5 | 8 | 3 | 2 revise fails | skip |
| Sonnet 4.6 | 4 | 5 | 3 revise fails | yes |
| Opus 4.6 / Capybara | 3 | 5 + full pattern sweep | 4 revise fails | 2 drafts, pick cleaner |
Weaker models: more parallel attempts, faster abstention. Stronger models: deeper verification, more presentation effort.
For numeric-answer problems (AIME-style)
Skip the proof machinery. Run 5-7 solvers with varied approaches, take majority vote on the numeric answer. If no majority: verify the top 2 candidates by substitution.
Key references
references/verifier_patterns.md— the 12 adversarial checksreferences/adversarial_prompts.md— ready-to-use verifier promptsreferences/presentation_prompts.md— beautification prompts + LaTeX templatereferences/model_tier_defaults.md— per-model configuration
What makes this different from generic verify-and-refine
- Dual context isolation: verifier is blind to (a) the solver's thinking trace — which biases toward agreement — and (b) other verifiers' verdicts — social proof also biases. Each verifier thinks it's first.
- Pattern-specific attacks: not "is this correct?" but "does this make the #40 mistake? the #4 mistake?" Specific beats generic. The 7-category refutation taxonomy gives the verifier a checklist.
- Asymmetric vote + pigeonhole exit: 4-to-confirm, 2-to-refute. One flaky verifier doesn't kill a correct proof; two dissents does. Stop launching verifiers once the outcome is decided — saves ~30% of verification cost on clear cases.
- Specification-gaming check first: explicitly asks "is this the intended interpretation?" before solving. The #1 failure mode in prior work (50/63 "correct" answers solved the wrong reading).
- Calibrated abstention: will say "no confident solution" with partial results. Optimizes conditional accuracy, not coverage.
- Presentation pass: correctness and elegance are separate steps. The presentation agent gets the VERIFIED proof and finds the cleanest way to say it.