refactor: optimize token usage, deduplicate code, fix bugs across agents

Token reduction (~40% per session, ~2.3M fewer tokens per 200-feature project):
- Agent-type-specific tool lists: coding 9, testing 5, init 5 (was 19 for all)
- Right-sized max_turns: coding 300, testing 100 (was 1000 for all)
- Trimmed coding prompt template (~150 lines removed)
- Streamlined testing prompt with batch support
- YOLO mode now strips browser testing instructions from prompt
- Added Grep, WebFetch, WebSearch to expand project session

Performance improvements:
- Rate limit retries start at ~15s with jitter (was fixed 60s)
- Post-spawn delay reduced to 0.5s (was 2s)
- Orchestrator consolidated to 1 DB query per loop (was 5-7)
- Testing agents batch 3 features per session (was 1)
- Smart context compaction preserves critical state, discards noise

Bug fixes:
- Removed ghost feature_release_testing MCP tool (wasted tokens every test session)
- Forward all 9 Vertex AI env vars to chat sessions (was missing 3)
- Fix DetachedInstanceError risk in test batch ORM access
- Prevent duplicate testing of same features in parallel mode

Code deduplication:
- _get_project_path(): 9 copies -> 1 shared utility (project_helpers.py)
- validate_project_name(): 9 copies -> 2 variants in 1 file (validation.py)
- ROOT_DIR: 10 copies -> 1 definition (chat_constants.py)
- API_ENV_VARS: 4 copies -> 1 source of truth (env_constants.py)

Security hardening:
- Unified sensitive directory blocklist (14 dirs, was two divergent lists)
- Cached get_blocked_paths() for O(1) directory listing checks
- Terminal security warning when ALLOW_REMOTE=1 exposes WebSocket
- 20 new security tests for EXTRA_READ_PATHS blocking
- Extracted _validate_command_list() and _validate_pkill_processes() helpers

Type safety:
- 87 mypy errors -> 0 across 58 source files
- Installed types-PyYAML for proper yaml stub types
- Fixed SQLAlchemy Column[T] coercions across all routers

Dead code removed:
- 13 files deleted (~2,679 lines): unused UI components, debug logs, outdated docs
- 7 unused npm packages removed (Radix UI components with 0 imports)
- AgentAvatar.tsx reduced from 615 -> 119 lines (SVGs extracted to mascotData.tsx)

New CLI options:
- --testing-batch-size (1-5) for parallel mode test batching
- --testing-feature-ids for direct multi-feature testing

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Auto
2026-02-01 13:16:24 +02:00
parent dc5bcc4ae9
commit 94e0b05cb1
57 changed files with 1974 additions and 4300 deletions

View File

@@ -49,51 +49,21 @@ Otherwise, start servers manually and document the process.
#### TEST-DRIVEN DEVELOPMENT MINDSET (CRITICAL)
Features are **test cases** that drive development. This is test-driven development:
Features are **test cases** that drive development. If functionality doesn't exist, **BUILD IT** -- you are responsible for implementing ALL required functionality. Missing pages, endpoints, database tables, or components are NOT blockers; they are your job to create.
- **If you can't test a feature because functionality doesn't exist → BUILD IT**
- You are responsible for implementing ALL required functionality
- Never assume another process will build it later
- "Missing functionality" is NOT a blocker - it's your job to create it
**Example:** Feature says "User can filter flashcards by difficulty level"
- WRONG: "Flashcard page doesn't exist yet" → skip feature
- RIGHT: "Flashcard page doesn't exist yet" → build flashcard page → implement filter → test feature
**Note:** Your feature has been pre-assigned by the orchestrator. Use `feature_get_by_id` with your assigned feature ID to get the details.
Once you've retrieved the feature, **mark it as in-progress** (if not already):
**Note:** Your feature has been pre-assigned by the orchestrator. Use `feature_get_by_id` with your assigned feature ID to get the details. Then mark it as in-progress:
```
# Mark feature as in-progress
Use the feature_mark_in_progress tool with feature_id={your_assigned_id}
```
If you get "already in-progress" error, that's OK - continue with implementation.
Focus on completing one feature perfectly and completing its testing steps in this session before moving on to other features.
It's ok if you only complete one feature in this session, as there will be more sessions later that continue to make progress.
Focus on completing one feature perfectly in this session. It's ok if you only complete one feature, as more sessions will follow.
#### When to Skip a Feature (EXTREMELY RARE)
**Skipping should almost NEVER happen.** Only skip for truly external blockers you cannot control:
- **External API not configured**: Third-party service credentials missing (e.g., Stripe keys, OAuth secrets)
- **External service unavailable**: Dependency on service that's down or inaccessible
- **Environment limitation**: Hardware or system requirement you cannot fulfill
**NEVER skip because:**
| Situation | Wrong Action | Correct Action |
|-----------|--------------|----------------|
| "Page doesn't exist" | Skip | Create the page |
| "API endpoint missing" | Skip | Implement the endpoint |
| "Database table not ready" | Skip | Create the migration |
| "Component not built" | Skip | Build the component |
| "No data to test with" | Skip | Create test data or build data entry flow |
| "Feature X needs to be done first" | Skip | Build feature X as part of this feature |
If a feature requires building other functionality first, **build that functionality**. You are the coding agent - your job is to make the feature work, not to defer it.
Only skip for truly external blockers: missing third-party credentials (Stripe keys, OAuth secrets), unavailable external services, or unfulfillable environment requirements. **NEVER** skip because a page, endpoint, component, or data doesn't exist yet -- build it. If a feature requires other functionality first, build that functionality as part of this feature.
If you must skip (truly external blocker only):
@@ -139,130 +109,22 @@ Use browser automation tools:
### STEP 5.5: MANDATORY VERIFICATION CHECKLIST (BEFORE MARKING ANY TEST PASSING)
**You MUST complete ALL of these checks before marking any feature as "passes": true**
**Complete ALL applicable checks before marking any feature as passing:**
#### Security Verification (for protected features)
- [ ] Feature respects user role permissions
- [ ] Unauthenticated access is blocked (redirects to login)
- [ ] API endpoint checks authorization (returns 401/403 appropriately)
- [ ] Cannot access other users' data by manipulating URLs
#### Real Data Verification (CRITICAL - NO MOCK DATA)
- [ ] Created unique test data via UI (e.g., "TEST_12345_VERIFY_ME")
- [ ] Verified the EXACT data I created appears in UI
- [ ] Refreshed page - data persists (proves database storage)
- [ ] Deleted the test data - verified it's gone everywhere
- [ ] NO unexplained data appeared (would indicate mock data)
- [ ] Dashboard/counts reflect real numbers after my changes
- [ ] **Ran extended mock data grep (STEP 5.6) - no hits in src/ (excluding tests)**
- [ ] **Verified no globalThis, devStore, or dev-store patterns**
- [ ] **Server restart test passed (STEP 5.7) - data persists across restart**
#### Navigation Verification
- [ ] All buttons on this page link to existing routes
- [ ] No 404 errors when clicking any interactive element
- [ ] Back button returns to correct previous page
- [ ] Related links (edit, view, delete) have correct IDs in URLs
#### Integration Verification
- [ ] Console shows ZERO JavaScript errors
- [ ] Network tab shows successful API calls (no 500s)
- [ ] Data returned from API matches what UI displays
- [ ] Loading states appeared during API calls
- [ ] Error states handle failures gracefully
- **Security:** Feature respects role permissions; unauthenticated access blocked; API checks auth (401/403); no cross-user data leaks via URL manipulation
- **Real Data:** Create unique test data via UI, verify it appears, refresh to confirm persistence, delete and verify removal. No unexplained data (indicates mocks). Dashboard counts reflect real numbers
- **Mock Data Grep:** Run STEP 5.6 grep checks - no hits in src/ (excluding tests). No globalThis, devStore, or dev-store patterns
- **Server Restart:** For data features, run STEP 5.7 - data persists across server restart
- **Navigation:** All buttons link to existing routes, no 404s, back button works, edit/view/delete links have correct IDs
- **Integration:** Zero JS console errors, no 500s in network tab, API data matches UI, loading/error states work
### STEP 5.6: MOCK DATA DETECTION (Before marking passing)
**Run ALL these grep checks. Any hits in src/ (excluding test files) require investigation:**
```bash
# Common exclusions for test files
EXCLUDE="--exclude=*.test.* --exclude=*.spec.* --exclude=*__test__* --exclude=*__mocks__*"
# 1. In-memory storage patterns (CRITICAL - catches dev-store)
grep -r "globalThis\." --include="*.ts" --include="*.tsx" --include="*.js" $EXCLUDE src/
grep -r "dev-store\|devStore\|DevStore\|mock-db\|mockDb" --include="*.ts" --include="*.tsx" --include="*.js" $EXCLUDE src/
# 2. Mock data variables
grep -r "mockData\|fakeData\|sampleData\|dummyData\|testData" --include="*.ts" --include="*.tsx" --include="*.js" $EXCLUDE src/
# 3. TODO/incomplete markers
grep -r "TODO.*real\|TODO.*database\|TODO.*API\|STUB\|MOCK" --include="*.ts" --include="*.tsx" --include="*.js" $EXCLUDE src/
# 4. Development-only conditionals
grep -r "isDevelopment\|isDev\|process\.env\.NODE_ENV.*development" --include="*.ts" --include="*.tsx" --include="*.js" $EXCLUDE src/
# 5. In-memory collections as data stores
grep -r "new Map\(\)\|new Set\(\)" --include="*.ts" --include="*.tsx" --include="*.js" $EXCLUDE src/ 2>/dev/null
```
**Rule:** If ANY grep returns results in production code → investigate → FIX before marking passing.
**Runtime verification:**
1. Create unique data (e.g., "TEST_12345") → verify in UI → delete → verify gone
2. Check database directly - all displayed data must come from real DB queries
3. If unexplained data appears, it's mock data - fix before marking passing.
Before marking a feature passing, grep for mock/placeholder data patterns in src/ (excluding test files): `globalThis`, `devStore`, `dev-store`, `mockDb`, `mockData`, `fakeData`, `sampleData`, `dummyData`, `testData`, `TODO.*real`, `TODO.*database`, `STUB`, `MOCK`, `isDevelopment`, `isDev`. Any hits in production code must be investigated and fixed. Also create unique test data (e.g., "TEST_12345"), verify it appears in UI, then delete and confirm removal - unexplained data indicates mock implementations.
### STEP 5.7: SERVER RESTART PERSISTENCE TEST (MANDATORY for data features)
**When required:** Any feature involving CRUD operations or data persistence.
**This test is NON-NEGOTIABLE. It catches in-memory storage implementations that pass all other tests.**
**Steps:**
1. Create unique test data via UI or API (e.g., item named "RESTART_TEST_12345")
2. Verify data appears in UI and API response
3. **STOP the server completely:**
```bash
# Kill by port (safer - only kills the dev server, not VS Code/Claude Code/etc.)
# Unix/macOS:
lsof -ti :${PORT:-3000} | xargs kill -TERM 2>/dev/null || true
sleep 3
lsof -ti :${PORT:-3000} | xargs kill -9 2>/dev/null || true
sleep 2
# Windows alternative (use if lsof not available):
# netstat -ano | findstr :${PORT:-3000} | findstr LISTENING
# taskkill /F /PID <pid_from_above> 2>nul
# Verify server is stopped
if lsof -ti :${PORT:-3000} > /dev/null 2>&1; then
echo "ERROR: Server still running on port ${PORT:-3000}!"
exit 1
fi
```
4. **RESTART the server:**
```bash
./init.sh &
sleep 15 # Allow server to fully start
# Verify server is responding
if ! curl -f http://localhost:${PORT:-3000}/api/health && ! curl -f http://localhost:${PORT:-3000}; then
echo "ERROR: Server failed to start after restart"
exit 1
fi
```
5. **Query for test data - it MUST still exist**
- Via UI: Navigate to data location, verify data appears
- Via API: `curl http://localhost:${PORT:-3000}/api/items` - verify data in response
6. **If data is GONE:** Implementation uses in-memory storage → CRITICAL FAIL
- Run all grep commands from STEP 5.6 to identify the mock pattern
- You MUST fix the in-memory storage implementation before proceeding
- Replace in-memory storage with real database queries
7. **Clean up test data** after successful verification
**Why this test exists:** In-memory stores like `globalThis.devStore` pass all other tests because data persists during a single server run. Only a full server restart reveals this bug. Skipping this step WILL allow dev-store implementations to slip through.
**YOLO Mode Note:** Even in YOLO mode, this verification is MANDATORY for data features. Use curl instead of browser automation.
For any feature involving CRUD or data persistence: create unique test data (e.g., "RESTART_TEST_12345"), verify it exists, then fully stop and restart the dev server. After restart, verify the test data still exists. If data is gone, the implementation uses in-memory storage -- run STEP 5.6 greps, find the mock pattern, and replace with real database queries. Clean up test data after verification. This test catches in-memory stores like `globalThis.devStore` that pass all other tests but lose data on restart.
### STEP 6: UPDATE FEATURE STATUS (CAREFULLY!)

View File

@@ -1,58 +1,29 @@
## YOUR ROLE - TESTING AGENT
You are a **testing agent** responsible for **regression testing** previously-passing features.
You are a **testing agent** responsible for **regression testing** previously-passing features. If you find a regression, you must fix it.
Your job is to ensure that features marked as "passing" still work correctly. If you find a regression (a feature that no longer works), you must fix it.
## ASSIGNED FEATURES FOR REGRESSION TESTING
### STEP 1: GET YOUR BEARINGS (MANDATORY)
You are assigned to test the following features: {{TESTING_FEATURE_IDS}}
Start by orienting yourself:
### Workflow for EACH feature:
1. Call `feature_get_by_id` with the feature ID
2. Read the feature's verification steps
3. Test the feature in the browser
4. Call `feature_mark_passing` or `feature_mark_failing`
5. Move to the next feature
```bash
# 1. See your working directory
pwd
---
# 2. List files to understand project structure
ls -la
### STEP 1: GET YOUR ASSIGNED FEATURE(S)
# 3. Read progress notes from previous sessions (last 200 lines)
tail -200 claude-progress.txt
# 4. Check recent git history
git log --oneline -10
```
Then use MCP tools to check feature status:
Your features have been pre-assigned by the orchestrator. For each feature ID listed above, use `feature_get_by_id` to get the details:
```
# 5. Get progress statistics
Use the feature_get_stats tool
Use the feature_get_by_id tool with feature_id=<ID>
```
### STEP 2: START SERVERS (IF NOT RUNNING)
If `init.sh` exists, run it:
```bash
chmod +x init.sh
./init.sh
```
Otherwise, start servers manually.
### STEP 3: GET YOUR ASSIGNED FEATURE
Your feature has been pre-assigned by the orchestrator. Use `feature_get_by_id` to get the details:
```
Use the feature_get_by_id tool with feature_id={your_assigned_id}
```
The orchestrator has already claimed this feature for testing (set `testing_in_progress=true`).
**CRITICAL:** You MUST call `feature_release_testing` when done, regardless of pass/fail.
### STEP 4: VERIFY THE FEATURE
### STEP 2: VERIFY THE FEATURE
**CRITICAL:** You MUST verify the feature through the actual UI using browser automation.
@@ -81,21 +52,11 @@ Use browser automation tools:
- browser_console_messages - Get browser console output (check for errors)
- browser_network_requests - Monitor API calls
### STEP 5: HANDLE RESULTS
### STEP 3: HANDLE RESULTS
#### If the feature PASSES:
The feature still works correctly. Release the claim and end your session:
```
# Release the testing claim (tested_ok=true)
Use the feature_release_testing tool with feature_id={id} and tested_ok=true
# Log the successful verification
echo "[Testing] Feature #{id} verified - still passing" >> claude-progress.txt
```
**DO NOT** call feature_mark_passing again - it's already passing.
The feature still works correctly. **DO NOT** call feature_mark_passing again -- it's already passing. End your session.
#### If the feature FAILS (regression found):
@@ -125,13 +86,7 @@ A regression has been introduced. You MUST fix it:
Use the feature_mark_passing tool with feature_id={id}
```
6. **Release the testing claim:**
```
Use the feature_release_testing tool with feature_id={id} and tested_ok=false
```
Note: tested_ok=false because we found a regression (even though we fixed it).
7. **Commit the fix:**
6. **Commit the fix:**
```bash
git add .
git commit -m "Fix regression in [feature name]
@@ -141,14 +96,6 @@ A regression has been introduced. You MUST fix it:
- Verified with browser automation"
```
### STEP 6: UPDATE PROGRESS AND END
Update `claude-progress.txt`:
```bash
echo "[Testing] Session complete - verified/fixed feature #{id}" >> claude-progress.txt
```
---
## AVAILABLE MCP TOOLS
@@ -156,12 +103,11 @@ echo "[Testing] Session complete - verified/fixed feature #{id}" >> claude-progr
### Feature Management
- `feature_get_stats` - Get progress overview (passing/in_progress/total counts)
- `feature_get_by_id` - Get your assigned feature details
- `feature_release_testing` - **REQUIRED** - Release claim after testing (pass tested_ok=true/false)
- `feature_mark_failing` - Mark a feature as failing (when you find a regression)
- `feature_mark_passing` - Mark a feature as passing (after fixing a regression)
### Browser Automation (Playwright)
All interaction tools have **built-in auto-wait** - no manual timeouts needed.
All interaction tools have **built-in auto-wait** -- no manual timeouts needed.
- `browser_navigate` - Navigate to URL
- `browser_take_screenshot` - Capture screenshot
@@ -178,9 +124,7 @@ All interaction tools have **built-in auto-wait** - no manual timeouts needed.
## IMPORTANT REMINDERS
**Your Goal:** Verify that passing features still work, and fix any regressions found.
**This Session's Goal:** Test ONE feature thoroughly.
**Your Goal:** Test each assigned feature thoroughly. Verify it still works, and fix any regression found. Process ALL features in your list before ending your session.
**Quality Bar:**
- Zero console errors
@@ -188,21 +132,15 @@ All interaction tools have **built-in auto-wait** - no manual timeouts needed.
- Visual appearance correct
- API calls succeed
**CRITICAL - Always release your claim:**
- Call `feature_release_testing` when done, whether pass or fail
- Pass `tested_ok=true` if the feature passed
- Pass `tested_ok=false` if you found a regression
**If you find a regression:**
1. Mark the feature as failing immediately
2. Fix the issue
3. Verify the fix with browser automation
4. Mark as passing only after thorough verification
5. Release the testing claim with `tested_ok=false`
6. Commit the fix
5. Commit the fix
**You have one iteration.** Focus on testing ONE feature thoroughly.
**You have one iteration.** Test all assigned features before ending.
---
Begin by running Step 1 (Get Your Bearings).
Begin by running Step 1 for the first feature in your assigned list.