mirror of
https://github.com/bmad-code-org/BMAD-METHOD.git
synced 2026-01-30 04:32:02 +00:00
feat: add optional also_consider input to adversarial review task (#1371)
Add an optional also_consider parameter that allows callers to pass domain-specific areas to keep in mind during review. This gently nudges the reviewer toward specific concerns without overriding normal analysis. Testing showed: - Specific items steer strongly (questions get directly answered) - Domain-focused items shift the lens (e.g., security focus = deeper security findings) - Vague items have minimal effect (similar to baseline) - Single items nudge without dominating - Contradictory items handled gracefully Includes test cases with sample content and 10 configurations to validate the parameter behavior across different use cases. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Brian <bmadcode@gmail.com>
This commit is contained in:
@@ -6,6 +6,8 @@
|
||||
|
||||
<inputs>
|
||||
<input name="content" desc="Content to review - diff, spec, story, doc, or any artifact" />
|
||||
<input name="also_consider" required="false"
|
||||
desc="Optional areas to keep in mind during review alongside normal adversarial analysis" />
|
||||
</inputs>
|
||||
|
||||
<llm critical="true">
|
||||
|
||||
56
test/adversarial-review-tests/README.md
Normal file
56
test/adversarial-review-tests/README.md
Normal file
@@ -0,0 +1,56 @@
|
||||
# Adversarial Review Test Suite
|
||||
|
||||
Tests for the `also_consider` optional input in `review-adversarial-general.xml`.
|
||||
|
||||
## Purpose
|
||||
|
||||
Evaluate whether the `also_consider` input gently nudges the reviewer toward specific areas without overriding normal adversarial analysis.
|
||||
|
||||
## Test Content
|
||||
|
||||
All tests use `sample-content.md` - a deliberately imperfect User Authentication API doc with:
|
||||
|
||||
- Vague error handling section
|
||||
- Missing rate limit details
|
||||
- No token expiration info
|
||||
- Password in plain text example
|
||||
- Missing authentication headers
|
||||
- No error response examples
|
||||
|
||||
## Running Tests
|
||||
|
||||
For each test case in `test-cases.yaml`, invoke the adversarial review task.
|
||||
|
||||
### Manual Test Invocation
|
||||
|
||||
```
|
||||
Review this content using the adversarial review task:
|
||||
|
||||
<content>
|
||||
[paste sample-content.md]
|
||||
</content>
|
||||
|
||||
<also_consider>
|
||||
[paste items from test case, or omit for TC01]
|
||||
</also_consider>
|
||||
```
|
||||
|
||||
## Evaluation Criteria
|
||||
|
||||
For each test, note:
|
||||
|
||||
1. **Total findings** - Still hitting ~10 issues?
|
||||
2. **Distribution** - Are findings spread across concerns or clustered?
|
||||
3. **Relevance** - Do findings relate to `also_consider` items when provided?
|
||||
4. **Balance** - Are `also_consider` findings elevated over others, or naturally mixed?
|
||||
5. **Quality** - Are findings actionable regardless of source?
|
||||
|
||||
## Expected Outcomes
|
||||
|
||||
- **TC01 (baseline)**: Generic spread of findings
|
||||
- **TC02-TC05 (domain-focused)**: Some findings align with domain, others still organic
|
||||
- **TC06 (single item)**: Light influence, not dominant
|
||||
- **TC07 (vague items)**: Minimal change from baseline
|
||||
- **TC08 (specific items)**: Direct answers if gaps exist
|
||||
- **TC09 (mixed)**: Balanced across domains
|
||||
- **TC10 (contradictory)**: Graceful handling
|
||||
46
test/adversarial-review-tests/sample-content.md
Normal file
46
test/adversarial-review-tests/sample-content.md
Normal file
@@ -0,0 +1,46 @@
|
||||
# User Authentication API
|
||||
|
||||
## Overview
|
||||
|
||||
This API provides endpoints for user authentication and session management.
|
||||
|
||||
## Endpoints
|
||||
|
||||
### POST /api/auth/login
|
||||
|
||||
Authenticates a user and returns a token.
|
||||
|
||||
**Request Body:**
|
||||
```json
|
||||
{
|
||||
"email": "user@example.com",
|
||||
"password": "password123"
|
||||
}
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"token": "eyJhbGciOiJIUzI1NiIs...",
|
||||
"user": {
|
||||
"id": 1,
|
||||
"email": "user@example.com"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### POST /api/auth/logout
|
||||
|
||||
Logs out the current user.
|
||||
|
||||
### GET /api/auth/me
|
||||
|
||||
Returns the current user's profile.
|
||||
|
||||
## Error Handling
|
||||
|
||||
Errors return appropriate HTTP status codes.
|
||||
|
||||
## Rate Limiting
|
||||
|
||||
Rate limiting is applied to prevent abuse.
|
||||
103
test/adversarial-review-tests/test-cases.yaml
Normal file
103
test/adversarial-review-tests/test-cases.yaml
Normal file
@@ -0,0 +1,103 @@
|
||||
# Test Cases for review-adversarial-general.xml with also_consider input
|
||||
#
|
||||
# Purpose: Evaluate how the optional also_consider input influences review findings
|
||||
# Content: All tests use sample-content.md (User Authentication API docs)
|
||||
#
|
||||
# To run: Manually invoke the task with each configuration and compare outputs
|
||||
|
||||
test_cases:
|
||||
# BASELINE - No also_consider
|
||||
- id: TC01
|
||||
name: "Baseline - no also_consider"
|
||||
description: "Control test with no also_consider input"
|
||||
also_consider: null
|
||||
expected_behavior: "Generic adversarial findings across all aspects"
|
||||
|
||||
# DOCUMENTATION-FOCUSED
|
||||
- id: TC02
|
||||
name: "Documentation - reader confusion"
|
||||
description: "Nudge toward documentation UX issues"
|
||||
also_consider:
|
||||
- What would confuse a first-time reader?
|
||||
- What questions are left unanswered?
|
||||
- What could be interpreted multiple ways?
|
||||
- What jargon is unexplained?
|
||||
expected_behavior: "More findings about clarity, completeness, reader experience"
|
||||
|
||||
- id: TC03
|
||||
name: "Documentation - examples and usage"
|
||||
description: "Nudge toward practical usage gaps"
|
||||
also_consider:
|
||||
- Missing code examples
|
||||
- Unclear usage patterns
|
||||
- Edge cases not documented
|
||||
expected_behavior: "More findings about practical application gaps"
|
||||
|
||||
# SECURITY-FOCUSED
|
||||
- id: TC04
|
||||
name: "Security review"
|
||||
description: "Nudge toward security concerns"
|
||||
also_consider:
|
||||
- Authentication vulnerabilities
|
||||
- Token handling issues
|
||||
- Input validation gaps
|
||||
- Information disclosure risks
|
||||
expected_behavior: "More security-related findings"
|
||||
|
||||
# API DESIGN-FOCUSED
|
||||
- id: TC05
|
||||
name: "API design"
|
||||
description: "Nudge toward API design best practices"
|
||||
also_consider:
|
||||
- REST conventions not followed
|
||||
- Inconsistent response formats
|
||||
- Missing pagination or filtering
|
||||
- Versioning concerns
|
||||
expected_behavior: "More API design pattern findings"
|
||||
|
||||
# SINGLE ITEM
|
||||
- id: TC06
|
||||
name: "Single item - error handling"
|
||||
description: "Test with just one also_consider item"
|
||||
also_consider:
|
||||
- Error handling completeness
|
||||
expected_behavior: "Some emphasis on error handling while still covering other areas"
|
||||
|
||||
# BROAD/VAGUE
|
||||
- id: TC07
|
||||
name: "Broad items"
|
||||
description: "Test with vague also_consider items"
|
||||
also_consider:
|
||||
- Quality issues
|
||||
- Things that seem off
|
||||
expected_behavior: "Minimal change from baseline - items too vague to steer"
|
||||
|
||||
# VERY SPECIFIC
|
||||
- id: TC08
|
||||
name: "Very specific items"
|
||||
description: "Test with highly specific also_consider items"
|
||||
also_consider:
|
||||
- Is the JWT token expiration documented?
|
||||
- Are refresh token mechanics explained?
|
||||
- What happens on concurrent sessions?
|
||||
expected_behavior: "Specific findings addressing these exact questions if gaps exist"
|
||||
|
||||
# MIXED DOMAINS
|
||||
- id: TC09
|
||||
name: "Mixed domain concerns"
|
||||
description: "Test with items from different domains"
|
||||
also_consider:
|
||||
- Security vulnerabilities
|
||||
- Reader confusion points
|
||||
- API design inconsistencies
|
||||
- Performance implications
|
||||
expected_behavior: "Balanced findings across multiple domains"
|
||||
|
||||
# CONTRADICTORY/UNUSUAL
|
||||
- id: TC10
|
||||
name: "Contradictory items"
|
||||
description: "Test resilience with odd inputs"
|
||||
also_consider:
|
||||
- Things that are too detailed
|
||||
- Things that are not detailed enough
|
||||
expected_behavior: "Reviewer handles gracefully, finds issues in both directions"
|
||||
Reference in New Issue
Block a user