feat: add optional also_consider input to adversarial review task (#1371)

Add an optional also_consider parameter that allows callers to pass domain-specific areas to keep in mind during review. This gently nudges the reviewer toward specific concerns without overriding normal analysis. Testing showed: - Specific items steer strongly (questions get directly answered) - Domain-focused items shift the lens (e.g., security focus = deeper security findings) - Vague items have minimal effect (similar to baseline) - Single items nudge without dominating - Contradictory items handled gracefully Includes test cases with sample content and 10 configurations to validate the parameter behavior across different use cases. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Brian <bmadcode@gmail.com>
2026-01-30 04:32:02 +00:00 · 2026-01-22 20:26:25 -08:00
parent c9f2dc51db
commit aad132c9b1
4 changed files with 207 additions and 0 deletions
--- a/src/core/tasks/review-adversarial-general.xml
+++ b/src/core/tasks/review-adversarial-general.xml
@@ -6,6 +6,8 @@
  <inputs>
    <input name="content" desc="Content to review - diff, spec, story, doc, or any artifact" />
    <input name="also_consider" required="false"
      desc="Optional areas to keep in mind during review alongside normal adversarial analysis" />
  </inputs>
  <llm critical="true">
--- a/test/adversarial-review-tests/README.md
+++ b/test/adversarial-review-tests/README.md
@@ -0,0 +1,56 @@
 # Adversarial Review Test Suite
 Tests for the `also_consider` optional input in `review-adversarial-general.xml`.
 ## Purpose
 Evaluate whether the `also_consider` input gently nudges the reviewer toward specific areas without overriding normal adversarial analysis.
 ## Test Content
 All tests use `sample-content.md` - a deliberately imperfect User Authentication API doc with:
 - Vague error handling section
 - Missing rate limit details
 - No token expiration info
 - Password in plain text example
 - Missing authentication headers
 - No error response examples
 ## Running Tests
 For each test case in `test-cases.yaml`, invoke the adversarial review task.
 ### Manual Test Invocation
 ```
 Review this content using the adversarial review task:
 <content>
 [paste sample-content.md]
 </content>
 <also_consider>
 [paste items from test case, or omit for TC01]
 </also_consider>
 ```
 ## Evaluation Criteria
 For each test, note:
 1. **Total findings** - Still hitting ~10 issues?
 2. **Distribution** - Are findings spread across concerns or clustered?
 3. **Relevance** - Do findings relate to `also_consider` items when provided?
 4. **Balance** - Are `also_consider` findings elevated over others, or naturally mixed?
 5. **Quality** - Are findings actionable regardless of source?
 ## Expected Outcomes
 - **TC01 (baseline)**: Generic spread of findings
 - **TC02-TC05 (domain-focused)**: Some findings align with domain, others still organic
 - **TC06 (single item)**: Light influence, not dominant
 - **TC07 (vague items)**: Minimal change from baseline
 - **TC08 (specific items)**: Direct answers if gaps exist
 - **TC09 (mixed)**: Balanced across domains
 - **TC10 (contradictory)**: Graceful handling
--- a/test/adversarial-review-tests/sample-content.md
+++ b/test/adversarial-review-tests/sample-content.md
@@ -0,0 +1,46 @@
 # User Authentication API
 ## Overview
 This API provides endpoints for user authentication and session management.
 ## Endpoints
 ### POST /api/auth/login
 Authenticates a user and returns a token.
 **Request Body:**
 ```json
 {
  "email": "user@example.com",
  "password": "password123"
 }
 ```
 **Response:**
 ```json
 {
  "token": "eyJhbGciOiJIUzI1NiIs...",
  "user": {
    "id": 1,
    "email": "user@example.com"
  }
 }
 ```
 ### POST /api/auth/logout
 Logs out the current user.
 ### GET /api/auth/me
 Returns the current user's profile.
 ## Error Handling
 Errors return appropriate HTTP status codes.
 ## Rate Limiting
 Rate limiting is applied to prevent abuse.
--- a/test/adversarial-review-tests/test-cases.yaml
+++ b/test/adversarial-review-tests/test-cases.yaml
@@ -0,0 +1,103 @@
 # Test Cases for review-adversarial-general.xml with also_consider input
 #
 # Purpose: Evaluate how the optional also_consider input influences review findings
 # Content: All tests use sample-content.md (User Authentication API docs)
 #
 # To run: Manually invoke the task with each configuration and compare outputs
 test_cases:
  # BASELINE - No also_consider
  - id: TC01
    name: "Baseline - no also_consider"
    description: "Control test with no also_consider input"
    also_consider: null
    expected_behavior: "Generic adversarial findings across all aspects"
  # DOCUMENTATION-FOCUSED
  - id: TC02
    name: "Documentation - reader confusion"
    description: "Nudge toward documentation UX issues"
    also_consider:
      - What would confuse a first-time reader?
      - What questions are left unanswered?
      - What could be interpreted multiple ways?
      - What jargon is unexplained?
    expected_behavior: "More findings about clarity, completeness, reader experience"
  - id: TC03
    name: "Documentation - examples and usage"
    description: "Nudge toward practical usage gaps"
    also_consider:
      - Missing code examples
      - Unclear usage patterns
      - Edge cases not documented
    expected_behavior: "More findings about practical application gaps"
  # SECURITY-FOCUSED
  - id: TC04
    name: "Security review"
    description: "Nudge toward security concerns"
    also_consider:
      - Authentication vulnerabilities
      - Token handling issues
      - Input validation gaps
      - Information disclosure risks
    expected_behavior: "More security-related findings"
  # API DESIGN-FOCUSED
  - id: TC05
    name: "API design"
    description: "Nudge toward API design best practices"
    also_consider:
      - REST conventions not followed
      - Inconsistent response formats
      - Missing pagination or filtering
      - Versioning concerns
    expected_behavior: "More API design pattern findings"
  # SINGLE ITEM
  - id: TC06
    name: "Single item - error handling"
    description: "Test with just one also_consider item"
    also_consider:
      - Error handling completeness
    expected_behavior: "Some emphasis on error handling while still covering other areas"
  # BROAD/VAGUE
  - id: TC07
    name: "Broad items"
    description: "Test with vague also_consider items"
    also_consider:
      - Quality issues
      - Things that seem off
    expected_behavior: "Minimal change from baseline - items too vague to steer"
  # VERY SPECIFIC
  - id: TC08
    name: "Very specific items"
    description: "Test with highly specific also_consider items"
    also_consider:
      - Is the JWT token expiration documented?
      - Are refresh token mechanics explained?
      - What happens on concurrent sessions?
    expected_behavior: "Specific findings addressing these exact questions if gaps exist"
  # MIXED DOMAINS
  - id: TC09
    name: "Mixed domain concerns"
    description: "Test with items from different domains"
    also_consider:
      - Security vulnerabilities
      - Reader confusion points
      - API design inconsistencies
      - Performance implications
    expected_behavior: "Balanced findings across multiple domains"
  # CONTRADICTORY/UNUSUAL
  - id: TC10
    name: "Contradictory items"
    description: "Test resilience with odd inputs"
    also_consider:
      - Things that are too detailed
      - Things that are not detailed enough
    expected_behavior: "Reviewer handles gracefully, finds issues in both directions"