diff --git a/src/core/tasks/review-adversarial-general.xml b/src/core/tasks/review-adversarial-general.xml index 4e68ff9a..4af77a66 100644 --- a/src/core/tasks/review-adversarial-general.xml +++ b/src/core/tasks/review-adversarial-general.xml @@ -6,6 +6,8 @@ + diff --git a/test/adversarial-review-tests/README.md b/test/adversarial-review-tests/README.md new file mode 100644 index 00000000..8d2af507 --- /dev/null +++ b/test/adversarial-review-tests/README.md @@ -0,0 +1,56 @@ +# Adversarial Review Test Suite + +Tests for the `also_consider` optional input in `review-adversarial-general.xml`. + +## Purpose + +Evaluate whether the `also_consider` input gently nudges the reviewer toward specific areas without overriding normal adversarial analysis. + +## Test Content + +All tests use `sample-content.md` - a deliberately imperfect User Authentication API doc with: + +- Vague error handling section +- Missing rate limit details +- No token expiration info +- Password in plain text example +- Missing authentication headers +- No error response examples + +## Running Tests + +For each test case in `test-cases.yaml`, invoke the adversarial review task. + +### Manual Test Invocation + +``` +Review this content using the adversarial review task: + + +[paste sample-content.md] + + + +[paste items from test case, or omit for TC01] + +``` + +## Evaluation Criteria + +For each test, note: + +1. **Total findings** - Still hitting ~10 issues? +2. **Distribution** - Are findings spread across concerns or clustered? +3. **Relevance** - Do findings relate to `also_consider` items when provided? +4. **Balance** - Are `also_consider` findings elevated over others, or naturally mixed? +5. **Quality** - Are findings actionable regardless of source? + +## Expected Outcomes + +- **TC01 (baseline)**: Generic spread of findings +- **TC02-TC05 (domain-focused)**: Some findings align with domain, others still organic +- **TC06 (single item)**: Light influence, not dominant +- **TC07 (vague items)**: Minimal change from baseline +- **TC08 (specific items)**: Direct answers if gaps exist +- **TC09 (mixed)**: Balanced across domains +- **TC10 (contradictory)**: Graceful handling diff --git a/test/adversarial-review-tests/sample-content.md b/test/adversarial-review-tests/sample-content.md new file mode 100644 index 00000000..a821096d --- /dev/null +++ b/test/adversarial-review-tests/sample-content.md @@ -0,0 +1,46 @@ +# User Authentication API + +## Overview + +This API provides endpoints for user authentication and session management. + +## Endpoints + +### POST /api/auth/login + +Authenticates a user and returns a token. + +**Request Body:** +```json +{ + "email": "user@example.com", + "password": "password123" +} +``` + +**Response:** +```json +{ + "token": "eyJhbGciOiJIUzI1NiIs...", + "user": { + "id": 1, + "email": "user@example.com" + } +} +``` + +### POST /api/auth/logout + +Logs out the current user. + +### GET /api/auth/me + +Returns the current user's profile. + +## Error Handling + +Errors return appropriate HTTP status codes. + +## Rate Limiting + +Rate limiting is applied to prevent abuse. diff --git a/test/adversarial-review-tests/test-cases.yaml b/test/adversarial-review-tests/test-cases.yaml new file mode 100644 index 00000000..7f20e84f --- /dev/null +++ b/test/adversarial-review-tests/test-cases.yaml @@ -0,0 +1,103 @@ +# Test Cases for review-adversarial-general.xml with also_consider input +# +# Purpose: Evaluate how the optional also_consider input influences review findings +# Content: All tests use sample-content.md (User Authentication API docs) +# +# To run: Manually invoke the task with each configuration and compare outputs + +test_cases: + # BASELINE - No also_consider + - id: TC01 + name: "Baseline - no also_consider" + description: "Control test with no also_consider input" + also_consider: null + expected_behavior: "Generic adversarial findings across all aspects" + + # DOCUMENTATION-FOCUSED + - id: TC02 + name: "Documentation - reader confusion" + description: "Nudge toward documentation UX issues" + also_consider: + - What would confuse a first-time reader? + - What questions are left unanswered? + - What could be interpreted multiple ways? + - What jargon is unexplained? + expected_behavior: "More findings about clarity, completeness, reader experience" + + - id: TC03 + name: "Documentation - examples and usage" + description: "Nudge toward practical usage gaps" + also_consider: + - Missing code examples + - Unclear usage patterns + - Edge cases not documented + expected_behavior: "More findings about practical application gaps" + + # SECURITY-FOCUSED + - id: TC04 + name: "Security review" + description: "Nudge toward security concerns" + also_consider: + - Authentication vulnerabilities + - Token handling issues + - Input validation gaps + - Information disclosure risks + expected_behavior: "More security-related findings" + + # API DESIGN-FOCUSED + - id: TC05 + name: "API design" + description: "Nudge toward API design best practices" + also_consider: + - REST conventions not followed + - Inconsistent response formats + - Missing pagination or filtering + - Versioning concerns + expected_behavior: "More API design pattern findings" + + # SINGLE ITEM + - id: TC06 + name: "Single item - error handling" + description: "Test with just one also_consider item" + also_consider: + - Error handling completeness + expected_behavior: "Some emphasis on error handling while still covering other areas" + + # BROAD/VAGUE + - id: TC07 + name: "Broad items" + description: "Test with vague also_consider items" + also_consider: + - Quality issues + - Things that seem off + expected_behavior: "Minimal change from baseline - items too vague to steer" + + # VERY SPECIFIC + - id: TC08 + name: "Very specific items" + description: "Test with highly specific also_consider items" + also_consider: + - Is the JWT token expiration documented? + - Are refresh token mechanics explained? + - What happens on concurrent sessions? + expected_behavior: "Specific findings addressing these exact questions if gaps exist" + + # MIXED DOMAINS + - id: TC09 + name: "Mixed domain concerns" + description: "Test with items from different domains" + also_consider: + - Security vulnerabilities + - Reader confusion points + - API design inconsistencies + - Performance implications + expected_behavior: "Balanced findings across multiple domains" + + # CONTRADICTORY/UNUSUAL + - id: TC10 + name: "Contradictory items" + description: "Test resilience with odd inputs" + also_consider: + - Things that are too detailed + - Things that are not detailed enough + expected_behavior: "Reviewer handles gracefully, finds issues in both directions"