refactor: optimize token usage, deduplicate code, fix bugs across agents

Token reduction (~40% per session, ~2.3M fewer tokens per 200-feature project):
- Agent-type-specific tool lists: coding 9, testing 5, init 5 (was 19 for all)
- Right-sized max_turns: coding 300, testing 100 (was 1000 for all)
- Trimmed coding prompt template (~150 lines removed)
- Streamlined testing prompt with batch support
- YOLO mode now strips browser testing instructions from prompt
- Added Grep, WebFetch, WebSearch to expand project session

Performance improvements:
- Rate limit retries start at ~15s with jitter (was fixed 60s)
- Post-spawn delay reduced to 0.5s (was 2s)
- Orchestrator consolidated to 1 DB query per loop (was 5-7)
- Testing agents batch 3 features per session (was 1)
- Smart context compaction preserves critical state, discards noise

Bug fixes:
- Removed ghost feature_release_testing MCP tool (wasted tokens every test session)
- Forward all 9 Vertex AI env vars to chat sessions (was missing 3)
- Fix DetachedInstanceError risk in test batch ORM access
- Prevent duplicate testing of same features in parallel mode

Code deduplication:
- _get_project_path(): 9 copies -> 1 shared utility (project_helpers.py)
- validate_project_name(): 9 copies -> 2 variants in 1 file (validation.py)
- ROOT_DIR: 10 copies -> 1 definition (chat_constants.py)
- API_ENV_VARS: 4 copies -> 1 source of truth (env_constants.py)

Security hardening:
- Unified sensitive directory blocklist (14 dirs, was two divergent lists)
- Cached get_blocked_paths() for O(1) directory listing checks
- Terminal security warning when ALLOW_REMOTE=1 exposes WebSocket
- 20 new security tests for EXTRA_READ_PATHS blocking
- Extracted _validate_command_list() and _validate_pkill_processes() helpers

Type safety:
- 87 mypy errors -> 0 across 58 source files
- Installed types-PyYAML for proper yaml stub types
- Fixed SQLAlchemy Column[T] coercions across all routers

Dead code removed:
- 13 files deleted (~2,679 lines): unused UI components, debug logs, outdated docs
- 7 unused npm packages removed (Radix UI components with 0 imports)
- AgentAvatar.tsx reduced from 615 -> 119 lines (SVGs extracted to mascotData.tsx)

New CLI options:
- --testing-batch-size (1-5) for parallel mode test batching
- --testing-feature-ids for direct multi-feature testing

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Auto
2026-02-01 13:16:24 +02:00
parent dc5bcc4ae9
commit 94e0b05cb1
57 changed files with 1974 additions and 4300 deletions

View File

@@ -1,58 +1,29 @@
## YOUR ROLE - TESTING AGENT
You are a **testing agent** responsible for **regression testing** previously-passing features.
You are a **testing agent** responsible for **regression testing** previously-passing features. If you find a regression, you must fix it.
Your job is to ensure that features marked as "passing" still work correctly. If you find a regression (a feature that no longer works), you must fix it.
## ASSIGNED FEATURES FOR REGRESSION TESTING
### STEP 1: GET YOUR BEARINGS (MANDATORY)
You are assigned to test the following features: {{TESTING_FEATURE_IDS}}
Start by orienting yourself:
### Workflow for EACH feature:
1. Call `feature_get_by_id` with the feature ID
2. Read the feature's verification steps
3. Test the feature in the browser
4. Call `feature_mark_passing` or `feature_mark_failing`
5. Move to the next feature
```bash
# 1. See your working directory
pwd
---
# 2. List files to understand project structure
ls -la
### STEP 1: GET YOUR ASSIGNED FEATURE(S)
# 3. Read progress notes from previous sessions (last 200 lines)
tail -200 claude-progress.txt
# 4. Check recent git history
git log --oneline -10
```
Then use MCP tools to check feature status:
Your features have been pre-assigned by the orchestrator. For each feature ID listed above, use `feature_get_by_id` to get the details:
```
# 5. Get progress statistics
Use the feature_get_stats tool
Use the feature_get_by_id tool with feature_id=<ID>
```
### STEP 2: START SERVERS (IF NOT RUNNING)
If `init.sh` exists, run it:
```bash
chmod +x init.sh
./init.sh
```
Otherwise, start servers manually.
### STEP 3: GET YOUR ASSIGNED FEATURE
Your feature has been pre-assigned by the orchestrator. Use `feature_get_by_id` to get the details:
```
Use the feature_get_by_id tool with feature_id={your_assigned_id}
```
The orchestrator has already claimed this feature for testing (set `testing_in_progress=true`).
**CRITICAL:** You MUST call `feature_release_testing` when done, regardless of pass/fail.
### STEP 4: VERIFY THE FEATURE
### STEP 2: VERIFY THE FEATURE
**CRITICAL:** You MUST verify the feature through the actual UI using browser automation.
@@ -81,21 +52,11 @@ Use browser automation tools:
- browser_console_messages - Get browser console output (check for errors)
- browser_network_requests - Monitor API calls
### STEP 5: HANDLE RESULTS
### STEP 3: HANDLE RESULTS
#### If the feature PASSES:
The feature still works correctly. Release the claim and end your session:
```
# Release the testing claim (tested_ok=true)
Use the feature_release_testing tool with feature_id={id} and tested_ok=true
# Log the successful verification
echo "[Testing] Feature #{id} verified - still passing" >> claude-progress.txt
```
**DO NOT** call feature_mark_passing again - it's already passing.
The feature still works correctly. **DO NOT** call feature_mark_passing again -- it's already passing. End your session.
#### If the feature FAILS (regression found):
@@ -125,13 +86,7 @@ A regression has been introduced. You MUST fix it:
Use the feature_mark_passing tool with feature_id={id}
```
6. **Release the testing claim:**
```
Use the feature_release_testing tool with feature_id={id} and tested_ok=false
```
Note: tested_ok=false because we found a regression (even though we fixed it).
7. **Commit the fix:**
6. **Commit the fix:**
```bash
git add .
git commit -m "Fix regression in [feature name]
@@ -141,14 +96,6 @@ A regression has been introduced. You MUST fix it:
- Verified with browser automation"
```
### STEP 6: UPDATE PROGRESS AND END
Update `claude-progress.txt`:
```bash
echo "[Testing] Session complete - verified/fixed feature #{id}" >> claude-progress.txt
```
---
## AVAILABLE MCP TOOLS
@@ -156,12 +103,11 @@ echo "[Testing] Session complete - verified/fixed feature #{id}" >> claude-progr
### Feature Management
- `feature_get_stats` - Get progress overview (passing/in_progress/total counts)
- `feature_get_by_id` - Get your assigned feature details
- `feature_release_testing` - **REQUIRED** - Release claim after testing (pass tested_ok=true/false)
- `feature_mark_failing` - Mark a feature as failing (when you find a regression)
- `feature_mark_passing` - Mark a feature as passing (after fixing a regression)
### Browser Automation (Playwright)
All interaction tools have **built-in auto-wait** - no manual timeouts needed.
All interaction tools have **built-in auto-wait** -- no manual timeouts needed.
- `browser_navigate` - Navigate to URL
- `browser_take_screenshot` - Capture screenshot
@@ -178,9 +124,7 @@ All interaction tools have **built-in auto-wait** - no manual timeouts needed.
## IMPORTANT REMINDERS
**Your Goal:** Verify that passing features still work, and fix any regressions found.
**This Session's Goal:** Test ONE feature thoroughly.
**Your Goal:** Test each assigned feature thoroughly. Verify it still works, and fix any regression found. Process ALL features in your list before ending your session.
**Quality Bar:**
- Zero console errors
@@ -188,21 +132,15 @@ All interaction tools have **built-in auto-wait** - no manual timeouts needed.
- Visual appearance correct
- API calls succeed
**CRITICAL - Always release your claim:**
- Call `feature_release_testing` when done, whether pass or fail
- Pass `tested_ok=true` if the feature passed
- Pass `tested_ok=false` if you found a regression
**If you find a regression:**
1. Mark the feature as failing immediately
2. Fix the issue
3. Verify the fix with browser automation
4. Mark as passing only after thorough verification
5. Release the testing claim with `tested_ok=false`
6. Commit the fix
5. Commit the fix
**You have one iteration.** Focus on testing ONE feature thoroughly.
**You have one iteration.** Test all assigned features before ending.
---
Begin by running Step 1 (Get Your Bearings).
Begin by running Step 1 for the first feature in your assigned list.