feat: add optional also_consider input to adversarial review task (#1371)

Add an optional also_consider parameter that allows callers to pass domain-specific areas to keep in mind during review. This gently nudges the reviewer toward specific concerns without overriding normal analysis. Testing showed: - Specific items steer strongly (questions get directly answered) - Domain-focused items shift the lens (e.g., security focus = deeper security findings) - Vague items have minimal effect (similar to baseline) - Single items nudge without dominating - Contradictory items handled gracefully Includes test cases with sample content and 10 configurations to validate the parameter behavior across different use cases. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Brian <bmadcode@gmail.com>
2026-01-30 04:32:02 +00:00 · 2026-01-22 20:26:25 -08:00
parent c9f2dc51db
commit aad132c9b1
4 changed files with 207 additions and 0 deletions
--- a/test/adversarial-review-tests/README.md
+++ b/test/adversarial-review-tests/README.md
@@ -0,0 +1,56 @@
+# Adversarial Review Test Suite
+
+Tests for the `also_consider` optional input in `review-adversarial-general.xml`.
+
+## Purpose
+
+Evaluate whether the `also_consider` input gently nudges the reviewer toward specific areas without overriding normal adversarial analysis.
+
+## Test Content
+
+All tests use `sample-content.md` - a deliberately imperfect User Authentication API doc with:
+
+- Vague error handling section
+- Missing rate limit details
+- No token expiration info
+- Password in plain text example
+- Missing authentication headers
+- No error response examples
+
+## Running Tests
+
+For each test case in `test-cases.yaml`, invoke the adversarial review task.
+
+### Manual Test Invocation
+
+```
+Review this content using the adversarial review task:
+
+<content>
+[paste sample-content.md]
+</content>
+
+<also_consider>
+[paste items from test case, or omit for TC01]
+</also_consider>
+```
+
+## Evaluation Criteria
+
+For each test, note:
+
+1. **Total findings** - Still hitting ~10 issues?
+2. **Distribution** - Are findings spread across concerns or clustered?
+3. **Relevance** - Do findings relate to `also_consider` items when provided?
+4. **Balance** - Are `also_consider` findings elevated over others, or naturally mixed?
+5. **Quality** - Are findings actionable regardless of source?
+
+## Expected Outcomes
+
+- **TC01 (baseline)**: Generic spread of findings
+- **TC02-TC05 (domain-focused)**: Some findings align with domain, others still organic
+- **TC06 (single item)**: Light influence, not dominant
+- **TC07 (vague items)**: Minimal change from baseline
+- **TC08 (specific items)**: Direct answers if gaps exist
+- **TC09 (mixed)**: Balanced across domains
+- **TC10 (contradictory)**: Graceful handling
--- a/test/adversarial-review-tests/sample-content.md
+++ b/test/adversarial-review-tests/sample-content.md
@@ -0,0 +1,46 @@
+# User Authentication API
+
+## Overview
+
+This API provides endpoints for user authentication and session management.
+
+## Endpoints
+
+### POST /api/auth/login
+
+Authenticates a user and returns a token.
+
+**Request Body:**
+```json
+{
+  "email": "user@example.com",
+  "password": "password123"
+}
+```
+
+**Response:**
+```json
+{
+  "token": "eyJhbGciOiJIUzI1NiIs...",
+  "user": {
+    "id": 1,
+    "email": "user@example.com"
+  }
+}
+```
+
+### POST /api/auth/logout
+
+Logs out the current user.
+
+### GET /api/auth/me
+
+Returns the current user's profile.
+
+## Error Handling
+
+Errors return appropriate HTTP status codes.
+
+## Rate Limiting
+
+Rate limiting is applied to prevent abuse.
--- a/test/adversarial-review-tests/test-cases.yaml
+++ b/test/adversarial-review-tests/test-cases.yaml
@@ -0,0 +1,103 @@
+# Test Cases for review-adversarial-general.xml with also_consider input
+#
+# Purpose: Evaluate how the optional also_consider input influences review findings
+# Content: All tests use sample-content.md (User Authentication API docs)
+#
+# To run: Manually invoke the task with each configuration and compare outputs
+
+test_cases:
+  # BASELINE - No also_consider
+  - id: TC01
+    name: "Baseline - no also_consider"
+    description: "Control test with no also_consider input"
+    also_consider: null
+    expected_behavior: "Generic adversarial findings across all aspects"
+
+  # DOCUMENTATION-FOCUSED
+  - id: TC02
+    name: "Documentation - reader confusion"
+    description: "Nudge toward documentation UX issues"
+    also_consider:
+      - What would confuse a first-time reader?
+      - What questions are left unanswered?
+      - What could be interpreted multiple ways?
+      - What jargon is unexplained?
+    expected_behavior: "More findings about clarity, completeness, reader experience"
+
+  - id: TC03
+    name: "Documentation - examples and usage"
+    description: "Nudge toward practical usage gaps"
+    also_consider:
+      - Missing code examples
+      - Unclear usage patterns
+      - Edge cases not documented
+    expected_behavior: "More findings about practical application gaps"
+
+  # SECURITY-FOCUSED
+  - id: TC04
+    name: "Security review"
+    description: "Nudge toward security concerns"
+    also_consider:
+      - Authentication vulnerabilities
+      - Token handling issues
+      - Input validation gaps
+      - Information disclosure risks
+    expected_behavior: "More security-related findings"
+
+  # API DESIGN-FOCUSED
+  - id: TC05
+    name: "API design"
+    description: "Nudge toward API design best practices"
+    also_consider:
+      - REST conventions not followed
+      - Inconsistent response formats
+      - Missing pagination or filtering
+      - Versioning concerns
+    expected_behavior: "More API design pattern findings"
+
+  # SINGLE ITEM
+  - id: TC06
+    name: "Single item - error handling"
+    description: "Test with just one also_consider item"
+    also_consider:
+      - Error handling completeness
+    expected_behavior: "Some emphasis on error handling while still covering other areas"
+
+  # BROAD/VAGUE
+  - id: TC07
+    name: "Broad items"
+    description: "Test with vague also_consider items"
+    also_consider:
+      - Quality issues
+      - Things that seem off
+    expected_behavior: "Minimal change from baseline - items too vague to steer"
+
+  # VERY SPECIFIC
+  - id: TC08
+    name: "Very specific items"
+    description: "Test with highly specific also_consider items"
+    also_consider:
+      - Is the JWT token expiration documented?
+      - Are refresh token mechanics explained?
+      - What happens on concurrent sessions?
+    expected_behavior: "Specific findings addressing these exact questions if gaps exist"
+
+  # MIXED DOMAINS
+  - id: TC09
+    name: "Mixed domain concerns"
+    description: "Test with items from different domains"
+    also_consider:
+      - Security vulnerabilities
+      - Reader confusion points
+      - API design inconsistencies
+      - Performance implications
+    expected_behavior: "Balanced findings across multiple domains"
+
+  # CONTRADICTORY/UNUSUAL
+  - id: TC10
+    name: "Contradictory items"
+    description: "Test resilience with odd inputs"
+    also_consider:
+      - Things that are too detailed
+      - Things that are not detailed enough
+    expected_behavior: "Reviewer handles gracefully, finds issues in both directions"