mirror of
https://github.com/bmad-code-org/BMAD-METHOD.git
synced 2026-01-30 04:32:02 +00:00
feat: add optional also_consider input to adversarial review task (#1371)
Add an optional also_consider parameter that allows callers to pass domain-specific areas to keep in mind during review. This gently nudges the reviewer toward specific concerns without overriding normal analysis. Testing showed: - Specific items steer strongly (questions get directly answered) - Domain-focused items shift the lens (e.g., security focus = deeper security findings) - Vague items have minimal effect (similar to baseline) - Single items nudge without dominating - Contradictory items handled gracefully Includes test cases with sample content and 10 configurations to validate the parameter behavior across different use cases. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Brian <bmadcode@gmail.com>
This commit is contained in:
@@ -6,6 +6,8 @@
|
|||||||
|
|
||||||
<inputs>
|
<inputs>
|
||||||
<input name="content" desc="Content to review - diff, spec, story, doc, or any artifact" />
|
<input name="content" desc="Content to review - diff, spec, story, doc, or any artifact" />
|
||||||
|
<input name="also_consider" required="false"
|
||||||
|
desc="Optional areas to keep in mind during review alongside normal adversarial analysis" />
|
||||||
</inputs>
|
</inputs>
|
||||||
|
|
||||||
<llm critical="true">
|
<llm critical="true">
|
||||||
|
|||||||
56
test/adversarial-review-tests/README.md
Normal file
56
test/adversarial-review-tests/README.md
Normal file
@@ -0,0 +1,56 @@
|
|||||||
|
# Adversarial Review Test Suite
|
||||||
|
|
||||||
|
Tests for the `also_consider` optional input in `review-adversarial-general.xml`.
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
|
||||||
|
Evaluate whether the `also_consider` input gently nudges the reviewer toward specific areas without overriding normal adversarial analysis.
|
||||||
|
|
||||||
|
## Test Content
|
||||||
|
|
||||||
|
All tests use `sample-content.md` - a deliberately imperfect User Authentication API doc with:
|
||||||
|
|
||||||
|
- Vague error handling section
|
||||||
|
- Missing rate limit details
|
||||||
|
- No token expiration info
|
||||||
|
- Password in plain text example
|
||||||
|
- Missing authentication headers
|
||||||
|
- No error response examples
|
||||||
|
|
||||||
|
## Running Tests
|
||||||
|
|
||||||
|
For each test case in `test-cases.yaml`, invoke the adversarial review task.
|
||||||
|
|
||||||
|
### Manual Test Invocation
|
||||||
|
|
||||||
|
```
|
||||||
|
Review this content using the adversarial review task:
|
||||||
|
|
||||||
|
<content>
|
||||||
|
[paste sample-content.md]
|
||||||
|
</content>
|
||||||
|
|
||||||
|
<also_consider>
|
||||||
|
[paste items from test case, or omit for TC01]
|
||||||
|
</also_consider>
|
||||||
|
```
|
||||||
|
|
||||||
|
## Evaluation Criteria
|
||||||
|
|
||||||
|
For each test, note:
|
||||||
|
|
||||||
|
1. **Total findings** - Still hitting ~10 issues?
|
||||||
|
2. **Distribution** - Are findings spread across concerns or clustered?
|
||||||
|
3. **Relevance** - Do findings relate to `also_consider` items when provided?
|
||||||
|
4. **Balance** - Are `also_consider` findings elevated over others, or naturally mixed?
|
||||||
|
5. **Quality** - Are findings actionable regardless of source?
|
||||||
|
|
||||||
|
## Expected Outcomes
|
||||||
|
|
||||||
|
- **TC01 (baseline)**: Generic spread of findings
|
||||||
|
- **TC02-TC05 (domain-focused)**: Some findings align with domain, others still organic
|
||||||
|
- **TC06 (single item)**: Light influence, not dominant
|
||||||
|
- **TC07 (vague items)**: Minimal change from baseline
|
||||||
|
- **TC08 (specific items)**: Direct answers if gaps exist
|
||||||
|
- **TC09 (mixed)**: Balanced across domains
|
||||||
|
- **TC10 (contradictory)**: Graceful handling
|
||||||
46
test/adversarial-review-tests/sample-content.md
Normal file
46
test/adversarial-review-tests/sample-content.md
Normal file
@@ -0,0 +1,46 @@
|
|||||||
|
# User Authentication API
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
This API provides endpoints for user authentication and session management.
|
||||||
|
|
||||||
|
## Endpoints
|
||||||
|
|
||||||
|
### POST /api/auth/login
|
||||||
|
|
||||||
|
Authenticates a user and returns a token.
|
||||||
|
|
||||||
|
**Request Body:**
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"email": "user@example.com",
|
||||||
|
"password": "password123"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Response:**
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"token": "eyJhbGciOiJIUzI1NiIs...",
|
||||||
|
"user": {
|
||||||
|
"id": 1,
|
||||||
|
"email": "user@example.com"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### POST /api/auth/logout
|
||||||
|
|
||||||
|
Logs out the current user.
|
||||||
|
|
||||||
|
### GET /api/auth/me
|
||||||
|
|
||||||
|
Returns the current user's profile.
|
||||||
|
|
||||||
|
## Error Handling
|
||||||
|
|
||||||
|
Errors return appropriate HTTP status codes.
|
||||||
|
|
||||||
|
## Rate Limiting
|
||||||
|
|
||||||
|
Rate limiting is applied to prevent abuse.
|
||||||
103
test/adversarial-review-tests/test-cases.yaml
Normal file
103
test/adversarial-review-tests/test-cases.yaml
Normal file
@@ -0,0 +1,103 @@
|
|||||||
|
# Test Cases for review-adversarial-general.xml with also_consider input
|
||||||
|
#
|
||||||
|
# Purpose: Evaluate how the optional also_consider input influences review findings
|
||||||
|
# Content: All tests use sample-content.md (User Authentication API docs)
|
||||||
|
#
|
||||||
|
# To run: Manually invoke the task with each configuration and compare outputs
|
||||||
|
|
||||||
|
test_cases:
|
||||||
|
# BASELINE - No also_consider
|
||||||
|
- id: TC01
|
||||||
|
name: "Baseline - no also_consider"
|
||||||
|
description: "Control test with no also_consider input"
|
||||||
|
also_consider: null
|
||||||
|
expected_behavior: "Generic adversarial findings across all aspects"
|
||||||
|
|
||||||
|
# DOCUMENTATION-FOCUSED
|
||||||
|
- id: TC02
|
||||||
|
name: "Documentation - reader confusion"
|
||||||
|
description: "Nudge toward documentation UX issues"
|
||||||
|
also_consider:
|
||||||
|
- What would confuse a first-time reader?
|
||||||
|
- What questions are left unanswered?
|
||||||
|
- What could be interpreted multiple ways?
|
||||||
|
- What jargon is unexplained?
|
||||||
|
expected_behavior: "More findings about clarity, completeness, reader experience"
|
||||||
|
|
||||||
|
- id: TC03
|
||||||
|
name: "Documentation - examples and usage"
|
||||||
|
description: "Nudge toward practical usage gaps"
|
||||||
|
also_consider:
|
||||||
|
- Missing code examples
|
||||||
|
- Unclear usage patterns
|
||||||
|
- Edge cases not documented
|
||||||
|
expected_behavior: "More findings about practical application gaps"
|
||||||
|
|
||||||
|
# SECURITY-FOCUSED
|
||||||
|
- id: TC04
|
||||||
|
name: "Security review"
|
||||||
|
description: "Nudge toward security concerns"
|
||||||
|
also_consider:
|
||||||
|
- Authentication vulnerabilities
|
||||||
|
- Token handling issues
|
||||||
|
- Input validation gaps
|
||||||
|
- Information disclosure risks
|
||||||
|
expected_behavior: "More security-related findings"
|
||||||
|
|
||||||
|
# API DESIGN-FOCUSED
|
||||||
|
- id: TC05
|
||||||
|
name: "API design"
|
||||||
|
description: "Nudge toward API design best practices"
|
||||||
|
also_consider:
|
||||||
|
- REST conventions not followed
|
||||||
|
- Inconsistent response formats
|
||||||
|
- Missing pagination or filtering
|
||||||
|
- Versioning concerns
|
||||||
|
expected_behavior: "More API design pattern findings"
|
||||||
|
|
||||||
|
# SINGLE ITEM
|
||||||
|
- id: TC06
|
||||||
|
name: "Single item - error handling"
|
||||||
|
description: "Test with just one also_consider item"
|
||||||
|
also_consider:
|
||||||
|
- Error handling completeness
|
||||||
|
expected_behavior: "Some emphasis on error handling while still covering other areas"
|
||||||
|
|
||||||
|
# BROAD/VAGUE
|
||||||
|
- id: TC07
|
||||||
|
name: "Broad items"
|
||||||
|
description: "Test with vague also_consider items"
|
||||||
|
also_consider:
|
||||||
|
- Quality issues
|
||||||
|
- Things that seem off
|
||||||
|
expected_behavior: "Minimal change from baseline - items too vague to steer"
|
||||||
|
|
||||||
|
# VERY SPECIFIC
|
||||||
|
- id: TC08
|
||||||
|
name: "Very specific items"
|
||||||
|
description: "Test with highly specific also_consider items"
|
||||||
|
also_consider:
|
||||||
|
- Is the JWT token expiration documented?
|
||||||
|
- Are refresh token mechanics explained?
|
||||||
|
- What happens on concurrent sessions?
|
||||||
|
expected_behavior: "Specific findings addressing these exact questions if gaps exist"
|
||||||
|
|
||||||
|
# MIXED DOMAINS
|
||||||
|
- id: TC09
|
||||||
|
name: "Mixed domain concerns"
|
||||||
|
description: "Test with items from different domains"
|
||||||
|
also_consider:
|
||||||
|
- Security vulnerabilities
|
||||||
|
- Reader confusion points
|
||||||
|
- API design inconsistencies
|
||||||
|
- Performance implications
|
||||||
|
expected_behavior: "Balanced findings across multiple domains"
|
||||||
|
|
||||||
|
# CONTRADICTORY/UNUSUAL
|
||||||
|
- id: TC10
|
||||||
|
name: "Contradictory items"
|
||||||
|
description: "Test resilience with odd inputs"
|
||||||
|
also_consider:
|
||||||
|
- Things that are too detailed
|
||||||
|
- Things that are not detailed enough
|
||||||
|
expected_behavior: "Reviewer handles gracefully, finds issues in both directions"
|
||||||
Reference in New Issue
Block a user