initial commit

This commit is contained in:
Cody Seibert
2025-12-07 16:43:26 -05:00
commit 3c8e786f29
70 changed files with 21487 additions and 0 deletions

4
reference/.gitignore vendored Normal file
View File

@@ -0,0 +1,4 @@
# Agent-generated output directories
# Log files
logs/

163
reference/README.md Normal file
View File

@@ -0,0 +1,163 @@
# Autonomous Coding Agent Demo
A minimal harness demonstrating long-running autonomous coding with the Claude Agent SDK. This demo implements a two-agent pattern (initializer + coding agent) that can build complete applications over multiple sessions.
## Prerequisites
**Required:** Install the latest versions of both Claude Code and the Claude Agent SDK:
```bash
# Install Claude Code CLI (latest version required)
npm install -g @anthropic-ai/claude-code
# Install Python dependencies
pip install -r requirements.txt
```
Verify your installations:
```bash
claude --version # Should be latest version
pip show claude-code-sdk # Check SDK is installed
```
**API Key:** Set your Anthropic API key:
```bash
export ANTHROPIC_API_KEY='your-api-key-here'
```
## Quick Start
```bash
python autonomous_agent_demo.py --project-dir ./my_project
```
For testing with limited iterations:
```bash
python autonomous_agent_demo.py --project-dir ./my_project --max-iterations 3
```
## Important Timing Expectations
> **Warning: This demo takes a long time to run!**
- **First session (initialization):** The agent generates a `feature_list.json` with 200 test cases. This takes several minutes and may appear to hang - this is normal. The agent is writing out all the features.
- **Subsequent sessions:** Each coding iteration can take **5-15 minutes** depending on complexity.
- **Full app:** Building all 200 features typically requires **many hours** of total runtime across multiple sessions.
**Tip:** The 200 features parameter in the prompts is designed for comprehensive coverage. If you want faster demos, you can modify `prompts/initializer_prompt.md` to reduce the feature count (e.g., 20-50 features for a quicker demo).
## How It Works
### Two-Agent Pattern
1. **Initializer Agent (Session 1):** Reads `app_spec.txt`, creates `feature_list.json` with 200 test cases, sets up project structure, and initializes git.
2. **Coding Agent (Sessions 2+):** Picks up where the previous session left off, implements features one by one, and marks them as passing in `feature_list.json`.
### Session Management
- Each session runs with a fresh context window
- Progress is persisted via `feature_list.json` and git commits
- The agent auto-continues between sessions (3 second delay)
- Press `Ctrl+C` to pause; run the same command to resume
## Security Model
This demo uses a defense-in-depth security approach (see `security.py` and `client.py`):
1. **OS-level Sandbox:** Bash commands run in an isolated environment
2. **Filesystem Restrictions:** File operations restricted to the project directory only
3. **Bash Allowlist:** Only specific commands are permitted:
- File inspection: `ls`, `cat`, `head`, `tail`, `wc`, `grep`
- Node.js: `npm`, `node`
- Version control: `git`
- Process management: `ps`, `lsof`, `sleep`, `pkill` (dev processes only)
Commands not in the allowlist are blocked by the security hook.
## Project Structure
```
autonomous-coding/
├── autonomous_agent_demo.py # Main entry point
├── agent.py # Agent session logic
├── client.py # Claude SDK client configuration
├── security.py # Bash command allowlist and validation
├── progress.py # Progress tracking utilities
├── prompts.py # Prompt loading utilities
├── prompts/
│ ├── app_spec.txt # Application specification
│ ├── initializer_prompt.md # First session prompt
│ └── coding_prompt.md # Continuation session prompt
└── requirements.txt # Python dependencies
```
## Generated Project Structure
After running, your project directory will contain:
```
my_project/
├── feature_list.json # Test cases (source of truth)
├── app_spec.txt # Copied specification
├── init.sh # Environment setup script
├── claude-progress.txt # Session progress notes
├── .claude_settings.json # Security settings
└── [application files] # Generated application code
```
## Running the Generated Application
After the agent completes (or pauses), you can run the generated application:
```bash
cd generations/my_project
# Run the setup script created by the agent
./init.sh
# Or manually (typical for Node.js apps):
npm install
npm run dev
```
The application will typically be available at `http://localhost:3000` or similar (check the agent's output or `init.sh` for the exact URL).
## Command Line Options
| Option | Description | Default |
|--------|-------------|---------|
| `--project-dir` | Directory for the project | `./autonomous_demo_project` |
| `--max-iterations` | Max agent iterations | Unlimited |
| `--model` | Claude model to use | `claude-sonnet-4-5-20250929` |
## Customization
### Changing the Application
Edit `prompts/app_spec.txt` to specify a different application to build.
### Adjusting Feature Count
Edit `prompts/initializer_prompt.md` and change the "200 features" requirement to a smaller number for faster demos.
### Modifying Allowed Commands
Edit `security.py` to add or remove commands from `ALLOWED_COMMANDS`.
## Troubleshooting
**"Appears to hang on first run"**
This is normal. The initializer agent is generating 200 detailed test cases, which takes significant time. Watch for `[Tool: ...]` output to confirm the agent is working.
**"Command blocked by security hook"**
The agent tried to run a command not in the allowlist. This is the security system working as intended. If needed, add the command to `ALLOWED_COMMANDS` in `security.py`.
**"API key not set"**
Ensure `ANTHROPIC_API_KEY` is exported in your shell environment.
## License
Internal Anthropic use.

99
reference/SETUP.md Normal file
View File

@@ -0,0 +1,99 @@
# Autonomous Coding Agent Setup
This autonomous coding agent now uses the **Claude Code CLI directly** instead of the Python SDK.
## Prerequisites
1. **Claude Code** must be installed on your system
2. You must authenticate Claude Code for **headless mode** (--print flag)
## Authentication Setup
The `--print` (headless) mode requires a long-lived authentication token. To set this up:
### Option 1: Setup Token (Recommended)
Run this command in your own terminal (requires Claude subscription):
```bash
claude setup-token
```
This will open your browser and authenticate Claude Code for headless usage.
### Option 2: Use API Key
If you have an Anthropic API key instead:
```bash
export ANTHROPIC_API_KEY='your-api-key-here'
```
Or for OAuth tokens:
```bash
export CLAUDE_CODE_OAUTH_TOKEN='your-oauth-token-here'
```
## Usage
Once authenticated, run:
```bash
python3 autonomous_agent_demo.py --project-dir ./my_project --max-iterations 3
```
### Options:
- `--project-dir`: Directory for your project (default: `./autonomous_demo_project`)
- `--max-iterations`: Maximum number of agent iterations (default: unlimited)
- `--model`: Claude model to use (default: `opus` for Opus 4.5)
### Examples:
```bash
# Start a new project with Opus 4.5
python3 autonomous_agent_demo.py --project-dir ./my_app
# Limit iterations for testing
python3 autonomous_agent_demo.py --project-dir ./my_app --max-iterations 5
# Use a different model
python3 autonomous_agent_demo.py --project-dir ./my_app --model sonnet
```
## How It Works
The agent:
1. Creates configuration files (`.claude_settings.json`, `.mcp_config.json`)
2. Calls `claude --print` with your prompt
3. Captures the output and continues the autonomous loop
4. Uses your existing Claude Code authentication
## Troubleshooting
### "Invalid API key" Error
This means Claude Code isn't authenticated for headless mode. Run:
```bash
claude setup-token
```
### Check Authentication Status
Test if headless mode works:
```bash
echo "Hello" | claude --print --model opus
```
If this works, the autonomous agent will work too.
### Still Having Issues?
1. Make sure Claude Code is installed: `claude --version`
2. Check that you can run Claude normally: `claude`
3. Verify `claude` is in your PATH: `which claude`
4. Try re-authenticating: `claude setup-token`

206
reference/agent.py Normal file
View File

@@ -0,0 +1,206 @@
"""
Agent Session Logic
===================
Core agent interaction functions for running autonomous coding sessions.
"""
import asyncio
from pathlib import Path
from typing import Optional
from claude_code_sdk import ClaudeSDKClient
from client import create_client
from progress import print_session_header, print_progress_summary
from prompts import get_initializer_prompt, get_coding_prompt, copy_spec_to_project
# Configuration
AUTO_CONTINUE_DELAY_SECONDS = 3
async def run_agent_session(
client: ClaudeSDKClient,
message: str,
project_dir: Path,
) -> tuple[str, str]:
"""
Run a single agent session using Claude Agent SDK.
Args:
client: Claude SDK client
message: The prompt to send
project_dir: Project directory path
Returns:
(status, response_text) where status is:
- "continue" if agent should continue working
- "error" if an error occurred
"""
print("Sending prompt to Claude Agent SDK...\n")
try:
# Send the query
await client.query(message)
# Collect response text and show tool use
response_text = ""
async for msg in client.receive_response():
msg_type = type(msg).__name__
# Handle AssistantMessage (text and tool use)
if msg_type == "AssistantMessage" and hasattr(msg, "content"):
for block in msg.content:
block_type = type(block).__name__
if block_type == "TextBlock" and hasattr(block, "text"):
response_text += block.text
print(block.text, end="", flush=True)
elif block_type == "ToolUseBlock" and hasattr(block, "name"):
print(f"\n[Tool: {block.name}]", flush=True)
if hasattr(block, "input"):
input_str = str(block.input)
if len(input_str) > 200:
print(f" Input: {input_str[:200]}...", flush=True)
else:
print(f" Input: {input_str}", flush=True)
# Handle UserMessage (tool results)
elif msg_type == "UserMessage" and hasattr(msg, "content"):
for block in msg.content:
block_type = type(block).__name__
if block_type == "ToolResultBlock":
result_content = getattr(block, "content", "")
is_error = getattr(block, "is_error", False)
# Check if command was blocked by security hook
if "blocked" in str(result_content).lower():
print(f" [BLOCKED] {result_content}", flush=True)
elif is_error:
# Show errors (truncated)
error_str = str(result_content)[:500]
print(f" [Error] {error_str}", flush=True)
else:
# Tool succeeded - just show brief confirmation
print(" [Done]", flush=True)
print("\n" + "-" * 70 + "\n")
return "continue", response_text
except Exception as e:
print(f"Error during agent session: {e}")
return "error", str(e)
async def run_autonomous_agent(
project_dir: Path,
model: str,
max_iterations: Optional[int] = None,
) -> None:
"""
Run the autonomous agent loop.
Args:
project_dir: Directory for the project
model: Claude model to use
max_iterations: Maximum number of iterations (None for unlimited)
"""
print("\n" + "=" * 70)
print(" AUTONOMOUS CODING AGENT DEMO")
print("=" * 70)
print(f"\nProject directory: {project_dir}")
print(f"Model: {model}")
if max_iterations:
print(f"Max iterations: {max_iterations}")
else:
print("Max iterations: Unlimited (will run until completion)")
print()
# Create project directory
project_dir.mkdir(parents=True, exist_ok=True)
# Check if this is a fresh start or continuation
tests_file = project_dir / "feature_list.json"
is_first_run = not tests_file.exists()
if is_first_run:
print("Fresh start - will use initializer agent")
print()
print("=" * 70)
print(" NOTE: First session takes 10-20+ minutes!")
print(" The agent is generating 200 detailed test cases.")
print(" This may appear to hang - it's working. Watch for [Tool: ...] output.")
print("=" * 70)
print()
# Copy the app spec into the project directory for the agent to read
copy_spec_to_project(project_dir)
else:
print("Continuing existing project")
print_progress_summary(project_dir)
# Main loop
iteration = 0
while True:
iteration += 1
# Check max iterations
if max_iterations and iteration > max_iterations:
print(f"\nReached max iterations ({max_iterations})")
print("To continue, run the script again without --max-iterations")
break
# Print session header
print_session_header(iteration, is_first_run)
# Create client (fresh context)
client = create_client(project_dir, model)
# Choose prompt based on session type
if is_first_run:
prompt = get_initializer_prompt()
is_first_run = False # Only use initializer once
else:
prompt = get_coding_prompt()
# Run session with async context manager
async with client:
status, response = await run_agent_session(client, prompt, project_dir)
# Handle status
if status == "continue":
print(f"\nAgent will auto-continue in {AUTO_CONTINUE_DELAY_SECONDS}s...")
print_progress_summary(project_dir)
await asyncio.sleep(AUTO_CONTINUE_DELAY_SECONDS)
elif status == "error":
print("\nSession encountered an error")
print("Will retry with a fresh session...")
await asyncio.sleep(AUTO_CONTINUE_DELAY_SECONDS)
# Small delay between sessions
if max_iterations is None or iteration < max_iterations:
print("\nPreparing next session...\n")
await asyncio.sleep(1)
# Final summary
print("\n" + "=" * 70)
print(" SESSION COMPLETE")
print("=" * 70)
print(f"\nProject directory: {project_dir}")
print_progress_summary(project_dir)
# Print instructions for running the generated application
print("\n" + "-" * 70)
print(" TO RUN THE GENERATED APPLICATION:")
print("-" * 70)
print(f"\n cd {project_dir.resolve()}")
print(" ./init.sh # Run the setup script")
print(" # Or manually:")
print(" npm install && npm run dev")
print("\n Then open http://localhost:3000 (or check init.sh for the URL)")
print("-" * 70)
print("\nDone!")

View File

@@ -0,0 +1,123 @@
#!/usr/bin/env python3
"""
Autonomous Coding Agent Demo
============================
A minimal harness demonstrating long-running autonomous coding with Claude.
This script implements the two-agent pattern (initializer + coding agent) and
incorporates all the strategies from the long-running agents guide.
Example Usage:
python autonomous_agent_demo.py --project-dir ./claude_clone_demo
python autonomous_agent_demo.py --project-dir ./claude_clone_demo --max-iterations 5
"""
import argparse
import asyncio
import os
from pathlib import Path
from agent import run_autonomous_agent
# Configuration
# DEFAULT_MODEL = "claude-haiku-4-5-20251001"
# DEFAULT_MODEL = "claude-sonnet-4-5-20250929"
DEFAULT_MODEL = "claude-opus-4-5-20251101"
def parse_args() -> argparse.Namespace:
"""Parse command line arguments."""
parser = argparse.ArgumentParser(
description="Autonomous Coding Agent Demo - Long-running agent harness",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Start fresh project
python autonomous_agent_demo.py --project-dir ./claude_clone
# Use a specific model
python autonomous_agent_demo.py --project-dir ./claude_clone --model claude-sonnet-4-5-20250929
# Limit iterations for testing
python autonomous_agent_demo.py --project-dir ./claude_clone --max-iterations 5
# Continue existing project
python autonomous_agent_demo.py --project-dir ./claude_clone
Environment Variables:
ANTHROPIC_API_KEY Your Anthropic API key (required)
""",
)
parser.add_argument(
"--project-dir",
type=Path,
default=Path("./autonomous_demo_project"),
help="Directory for the project (default: generations/autonomous_demo_project). Relative paths automatically placed in generations/ directory.",
)
parser.add_argument(
"--max-iterations",
type=int,
default=None,
help="Maximum number of agent iterations (default: unlimited)",
)
parser.add_argument(
"--model",
type=str,
default=DEFAULT_MODEL,
help=f"Claude model to use (default: {DEFAULT_MODEL})",
)
return parser.parse_args()
def main() -> None:
"""Main entry point."""
args = parse_args()
# Check for auth: allow either API key or Claude Code auth token
has_api_key = bool(os.environ.get("ANTHROPIC_API_KEY"))
has_oauth_token = bool(os.environ.get("CLAUDE_CODE_OAUTH_TOKEN"))
if not (has_api_key or has_oauth_token):
print("Error: No Claude auth configured.")
print("\nSet ONE of the following:")
print(" # Standard API key from console.anthropic.com")
print(" export ANTHROPIC_API_KEY='your-api-key-here'")
print("\n # Or, your Claude Code auth token (from `claude setup-token`)")
print(" export CLAUDE_CODE_OAUTH_TOKEN='your-claude-code-auth-token'")
return
# Automatically place projects in generations/ directory unless already specified
project_dir = args.project_dir
if not str(project_dir).startswith("generations/"):
# Convert relative paths to be under generations/
if project_dir.is_absolute():
# If absolute path, use as-is
pass
else:
# Prepend generations/ to relative paths
project_dir = Path("generations") / project_dir
# Run the agent
try:
asyncio.run(
run_autonomous_agent(
project_dir=project_dir,
model=args.model,
max_iterations=args.max_iterations,
)
)
except KeyboardInterrupt:
print("\n\nInterrupted by user")
print("To resume, run the same command again")
except Exception as e:
print(f"\nFatal error: {e}")
raise
if __name__ == "__main__":
main()

130
reference/client.py Normal file
View File

@@ -0,0 +1,130 @@
"""
Claude SDK Client Configuration
===============================
Functions for creating and configuring the Claude Agent SDK client.
"""
import json
import os
from pathlib import Path
from claude_code_sdk import ClaudeCodeOptions, ClaudeSDKClient
from claude_code_sdk.types import HookMatcher
from security import bash_security_hook
# Puppeteer MCP tools for browser automation
PUPPETEER_TOOLS = [
"mcp__puppeteer__puppeteer_navigate",
"mcp__puppeteer__puppeteer_screenshot",
"mcp__puppeteer__puppeteer_click",
"mcp__puppeteer__puppeteer_fill",
"mcp__puppeteer__puppeteer_select",
"mcp__puppeteer__puppeteer_hover",
"mcp__puppeteer__puppeteer_evaluate",
]
# Built-in tools
BUILTIN_TOOLS = [
"Read",
"Write",
"Edit",
"Glob",
"Grep",
"Bash",
]
def create_client(project_dir: Path, model: str) -> ClaudeSDKClient:
"""Create a Claude Agent SDK client with multi-layered security.
Auth options
------------
This demo supports two ways of authenticating:
1. API key via ``ANTHROPIC_API_KEY`` (standard Claude API key)
2. Claude Code auth token via ``CLAUDE_CODE_OAUTH_TOKEN``
If neither is set, client creation will fail with a clear error.
Args:
project_dir: Directory for the project
model: Claude model to use
Returns:
Configured ClaudeSDKClient
Security layers (defense in depth):
1. Sandbox - OS-level bash command isolation prevents filesystem escape
2. Permissions - File operations restricted to project_dir only
3. Security hooks - Bash commands validated against an allowlist
(see security.py for ALLOWED_COMMANDS)
"""
api_key = os.environ.get("ANTHROPIC_API_KEY")
oauth_token = os.environ.get("CLAUDE_CODE_OAUTH_TOKEN")
if not api_key and not oauth_token:
raise ValueError(
"No Claude auth configured. Set either ANTHROPIC_API_KEY (Claude API key) "
"or CLAUDE_CODE_OAUTH_TOKEN (Claude Code auth token from `claude setup-token`)."
)
# Create comprehensive security settings
# Note: Using relative paths ("./**") restricts access to project directory
# since cwd is set to project_dir
security_settings = {
"sandbox": {"enabled": True, "autoAllowBashIfSandboxed": True},
"permissions": {
"defaultMode": "acceptEdits", # Auto-approve edits within allowed directories
"allow": [
# Allow all file operations within the project directory
"Read(./**)",
"Write(./**)",
"Edit(./**)",
"Glob(./**)",
"Grep(./**)",
# Bash permission granted here, but actual commands are validated
# by the bash_security_hook (see security.py for allowed commands)
"Bash(*)",
# Allow Puppeteer MCP tools for browser automation
*PUPPETEER_TOOLS,
],
},
}
# Ensure project directory exists before creating settings file
project_dir.mkdir(parents=True, exist_ok=True)
# Write settings to a file in the project directory
settings_file = project_dir / ".claude_settings.json"
with open(settings_file, "w") as f:
json.dump(security_settings, f, indent=2)
print(f"Created security settings at {settings_file}")
print(" - Sandbox enabled (OS-level bash isolation)")
print(f" - Filesystem restricted to: {project_dir.resolve()}")
print(" - Bash commands restricted to allowlist (see security.py)")
print(" - MCP servers: puppeteer (browser automation)")
print()
return ClaudeSDKClient(
options=ClaudeCodeOptions(
model=model,
system_prompt="You are an expert full-stack developer building a production-quality web application.",
allowed_tools=[
*BUILTIN_TOOLS,
*PUPPETEER_TOOLS,
],
mcp_servers={
"puppeteer": {"command": "npx", "args": ["puppeteer-mcp-server"]}
},
hooks={
"PreToolUse": [
HookMatcher(matcher="Bash", hooks=[bash_security_hook]),
],
},
max_turns=1000,
cwd=str(project_dir.resolve()),
settings=str(settings_file.resolve()), # Use absolute path
)
)

57
reference/progress.py Normal file
View File

@@ -0,0 +1,57 @@
"""
Progress Tracking Utilities
===========================
Functions for tracking and displaying progress of the autonomous coding agent.
"""
import json
from pathlib import Path
def count_passing_tests(project_dir: Path) -> tuple[int, int]:
"""
Count passing and total tests in feature_list.json.
Args:
project_dir: Directory containing feature_list.json
Returns:
(passing_count, total_count)
"""
tests_file = project_dir / "feature_list.json"
if not tests_file.exists():
return 0, 0
try:
with open(tests_file, "r") as f:
tests = json.load(f)
total = len(tests)
passing = sum(1 for test in tests if test.get("passes", False))
return passing, total
except (json.JSONDecodeError, IOError):
return 0, 0
def print_session_header(session_num: int, is_initializer: bool) -> None:
"""Print a formatted header for the session."""
session_type = "INITIALIZER" if is_initializer else "CODING AGENT"
print("\n" + "=" * 70)
print(f" SESSION {session_num}: {session_type}")
print("=" * 70)
print()
def print_progress_summary(project_dir: Path) -> None:
"""Print a summary of current progress."""
passing, total = count_passing_tests(project_dir)
if total > 0:
percentage = (passing / total) * 100
print(f"\nProgress: {passing}/{total} tests passing ({percentage:.1f}%)")
else:
print("\nProgress: feature_list.json not yet created")

37
reference/prompts.py Normal file
View File

@@ -0,0 +1,37 @@
"""
Prompt Loading Utilities
========================
Functions for loading prompt templates from the prompts directory.
"""
import shutil
from pathlib import Path
PROMPTS_DIR = Path(__file__).parent / "prompts"
def load_prompt(name: str) -> str:
"""Load a prompt template from the prompts directory."""
prompt_path = PROMPTS_DIR / f"{name}.md"
return prompt_path.read_text()
def get_initializer_prompt() -> str:
"""Load the initializer prompt."""
return load_prompt("initializer_prompt")
def get_coding_prompt() -> str:
"""Load the coding agent prompt."""
return load_prompt("coding_prompt")
def copy_spec_to_project(project_dir: Path) -> None:
"""Copy the app spec file into the project directory for the agent to read."""
spec_source = PROMPTS_DIR / "app_spec.txt"
spec_dest = project_dir / "app_spec.txt"
if not spec_dest.exists():
shutil.copy(spec_source, spec_dest)
print("Copied app_spec.txt to project directory")

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,291 @@
## YOUR ROLE - CODING AGENT
You are continuing work on a long-running autonomous development task.
This is a FRESH context window - you have no memory of previous sessions.
### STEP 1: GET YOUR BEARINGS (MANDATORY)
Start by orienting yourself:
```bash
# 1. See your working directory
pwd
# 2. List files to understand project structure
ls -la
# 3. Read the project specification to understand what you're building
cat app_spec.txt
# 4. Read the feature list to see all work
cat feature_list.json | head -50
# 5. Read progress notes from previous sessions
cat claude-progress.txt
# 6. Check recent git history
git log --oneline -20
# 7. Count remaining tests
cat feature_list.json | grep '"passes": false' | wc -l
```
Understanding the `app_spec.txt` is critical - it contains the full requirements
for the application you're building.
### STEP 2: START SERVERS (IF NOT RUNNING)
If `init.sh` exists, run it:
```bash
chmod +x init.sh
./init.sh
```
Otherwise, start servers manually and document the process.
### STEP 3: VERIFICATION TEST (CRITICAL!)
**MANDATORY BEFORE NEW WORK:**
The previous session may have introduced bugs. Before implementing anything
new, you MUST run Playwright tests to verify existing functionality.
```bash
# Run all existing Playwright tests
npx playwright test
# Or run tests for a specific feature
npx playwright test tests/[feature-name].spec.ts
```
If Playwright tests don't exist yet, create them in a `tests/` directory before proceeding.
**If any tests fail:**
- Mark that feature as "passes": false immediately in feature_list.json
- Fix all failing tests BEFORE moving to new features
- This includes UI bugs like:
- White-on-white text or poor contrast
- Random characters displayed
- Incorrect timestamps
- Layout issues or overflow
- Buttons too close together
- Missing hover states
- Console errors
### STEP 4: CHOOSE ONE FEATURE TO IMPLEMENT
Look at feature_list.json and find the highest-priority feature with "passes": false.
Focus on completing one feature perfectly and completing its testing steps in this session before moving on to other features.
It's ok if you only complete one feature in this session, as there will be more sessions later that continue to make progress.
### STEP 5: IMPLEMENT THE FEATURE
Implement the chosen feature thoroughly:
1. Write the code (frontend and/or backend as needed)
2. Write a Playwright happy path test for the feature (see Step 6)
3. Run the test and fix any issues discovered
4. Verify all tests pass before moving on
### STEP 6: VERIFY WITH PLAYWRIGHT TESTS
**CRITICAL:** You MUST verify features by writing and running Playwright tests.
**Write Happy Path Tests:**
For each feature, write a Playwright test that covers the happy path - the main user flow that should work correctly. These tests are fast to run and provide quick feedback.
```bash
# Example: Create test file
# tests/[feature-name].spec.ts
# Run the specific test
npx playwright test tests/[feature-name].spec.ts
# Run with headed mode to see the browser (useful for debugging)
npx playwright test tests/[feature-name].spec.ts --headed
```
**Test Structure (example):**
```typescript
import { test, expect } from "@playwright/test";
test("user can send a message and receive response", async ({ page }) => {
await page.goto("http://localhost:3000");
// Happy path: main user flow
await page.fill('[data-testid="message-input"]', "Hello world");
await page.click('[data-testid="send-button"]');
// Verify the expected outcome
await expect(page.locator('[data-testid="message-list"]')).toContainText(
"Hello world"
);
});
```
**DO:**
- Write tests that cover the primary user workflow (happy path)
- Use `data-testid` attributes for reliable selectors
- Run tests frequently during development
- Keep tests fast and focused
**DON'T:**
- Only test with curl commands (backend testing alone is insufficient)
- Write overly complex tests with many edge cases initially
- Skip running tests before marking features as passing
- Mark tests passing without all Playwright tests green
- Increase any playwright timeouts past 10s
### STEP 7: UPDATE feature_list.json (CAREFULLY!)
**YOU CAN ONLY MODIFY ONE FIELD: "passes"**
After thorough verification, change:
```json
"passes": false
```
to:
```json
"passes": true
```
**NEVER:**
- Remove tests
- Edit test descriptions
- Modify test steps
- Combine or consolidate tests
- Reorder tests
**ONLY CHANGE "passes" FIELD AFTER ALL PLAYWRIGHT TESTS PASS.**
### STEP 8: COMMIT YOUR PROGRESS
Make a descriptive git commit:
```bash
git add .
git commit -m "Implement [feature name] - verified with Playwright tests
- Added [specific changes]
- Added/updated Playwright tests in tests/
- All tests passing
- Updated feature_list.json: marked test #X as passing
"
git push origin main
```
### STEP 9: UPDATE PROGRESS NOTES
Update `claude-progress.txt` with:
- What you accomplished this session
- Which test(s) you completed
- Any issues discovered or fixed
- What should be worked on next
- Current completion status (e.g., "45/200 tests passing")
### STEP 10: END SESSION CLEANLY
Before context fills up:
1. Commit all working code
2. Update claude-progress.txt
3. Update feature_list.json if tests verified
4. Ensure no uncommitted changes
5. Leave app in working state (no broken features)
---
## TESTING REQUIREMENTS
**ALL testing must use Playwright tests.**
**Setup (if not already done):**
```bash
# Install Playwright
npm install -D @playwright/test
# Install browsers
npx playwright install
```
**Writing Tests:**
Create tests in the `tests/` directory with `.spec.ts` extension.
```typescript
// tests/example.spec.ts
import { test, expect } from "@playwright/test";
test.describe("Feature Name", () => {
test("happy path: user completes main workflow", async ({ page }) => {
await page.goto("http://localhost:3000");
// Interact with UI elements
await page.click('button[data-testid="action"]');
await page.fill('input[data-testid="input"]', "test value");
// Assert expected outcomes
await expect(page.locator('[data-testid="result"]')).toBeVisible();
});
});
```
**Running Tests:**
```bash
# Run all tests (fast, headless)
npx playwright test
# Run specific test file
npx playwright test tests/feature.spec.ts
# Run with browser visible (for debugging)
npx playwright test --headed
# Run with UI mode (interactive debugging)
npx playwright test --ui
```
**Best Practices:**
- Add `data-testid` attributes to elements for reliable selectors
- Focus on happy path tests first - they're fast and catch most regressions
- Keep tests independent and isolated
- Write tests as you implement features, not after
---
## IMPORTANT REMINDERS
**Your Goal:** Production-quality application with all 200+ tests passing
**This Session's Goal:** Complete at least one feature perfectly
**Priority:** Fix broken tests before implementing new features
**Quality Bar:**
- Zero console errors
- Polished UI matching the design specified in app_spec.txt (use landing page and generate page for true north of how design should look and be polished)
- All features work end-to-end through the UI
- Fast, responsive, professional
**You have unlimited time.** Take as long as needed to get it right. The most important thing is that you
leave the code base in a clean state before terminating the session (Step 10).
---
Begin by running Step 1 (Get Your Bearings).

View File

@@ -0,0 +1,106 @@
## YOUR ROLE - INITIALIZER AGENT (Session 1 of Many)
You are the FIRST agent in a long-running autonomous development process.
Your job is to set up the foundation for all future coding agents.
### FIRST: Read the Project Specification
Start by reading `app_spec.txt` in your working directory. This file contains
the complete specification for what you need to build. Read it carefully
before proceeding.
### CRITICAL FIRST TASK: Create feature_list.json
Based on `app_spec.txt`, create a file called `feature_list.json` with 200 detailed
end-to-end test cases. This file is the single source of truth for what
needs to be built.
**Format:**
```json
[
{
"category": "functional",
"description": "Brief description of the feature and what this test verifies",
"steps": [
"Step 1: Navigate to relevant page",
"Step 2: Perform action",
"Step 3: Verify expected result"
],
"passes": false
},
{
"category": "style",
"description": "Brief description of UI/UX requirement",
"steps": [
"Step 1: Navigate to page",
"Step 2: Take screenshot",
"Step 3: Verify visual requirements"
],
"passes": false
}
]
```
**Requirements for feature_list.json:**
- Minimum 200 features total with testing steps for each
- Both "functional" and "style" categories
- Mix of narrow tests (2-5 steps) and comprehensive tests (10+ steps)
- At least 25 tests MUST have 10+ steps each
- Order features by priority: fundamental features first
- ALL tests start with "passes": false
- Cover every feature in the spec exhaustively
**CRITICAL INSTRUCTION:**
IT IS CATASTROPHIC TO REMOVE OR EDIT FEATURES IN FUTURE SESSIONS.
Features can ONLY be marked as passing (change "passes": false to "passes": true).
Never remove features, never edit descriptions, never modify testing steps.
This ensures no functionality is missed.
### SECOND TASK: Create init.sh
Create a script called `init.sh` that future agents can use to quickly
set up and run the development environment. The script should:
1. Install any required dependencies
2. Start any necessary servers or services
3. Print helpful information about how to access the running application
Base the script on the technology stack specified in `app_spec.txt`.
### THIRD TASK: Initialize Git
Create a git repository and make your first commit with:
- feature_list.json (complete with all 200+ features)
- init.sh (environment setup script)
- README.md (project overview and setup instructions)
Commit message: "Initial setup: feature_list.json, init.sh, and project structure"
### FOURTH TASK: Create Project Structure
Set up the basic project structure based on what's specified in `app_spec.txt`.
This typically includes directories for frontend, backend, and any other
components mentioned in the spec.
### OPTIONAL: Start Implementation
If you have time remaining in this session, you may begin implementing
the highest-priority features from feature_list.json. Remember:
- Work on ONE feature at a time
- Test thoroughly before marking "passes": true
- Commit your progress before session ends
### ENDING THIS SESSION
Before your context fills up:
1. Commit all work with descriptive messages
2. Create `claude-progress.txt` with a summary of what you accomplished
3. Ensure feature_list.json is complete and saved
4. Leave the environment in a clean, working state
The next agent will continue from here with a fresh context window.
---
**Remember:** You have unlimited time across many sessions. Focus on
quality over speed. Production-ready is the goal.

View File

@@ -0,0 +1 @@
claude-code-sdk>=0.0.25

370
reference/security.py Normal file
View File

@@ -0,0 +1,370 @@
"""
Security Hooks for Autonomous Coding Agent
==========================================
Pre-tool-use hooks that validate bash commands for security.
Uses an allowlist approach - only explicitly permitted commands can run.
"""
import os
import shlex
# Allowed commands for development tasks
# Minimal set needed for the autonomous coding demo
ALLOWED_COMMANDS = {
# File inspection
"ls",
"cat",
"head",
"tail",
"wc",
"grep",
# File operations (agent uses SDK tools for most file ops, but cp/mkdir needed occasionally)
"cp",
"mkdir",
"chmod", # For making scripts executable; validated separately
# Directory
"pwd",
# Node.js development
"npm",
"node",
# Version control
"git",
# Process management
"ps",
"lsof",
"sleep",
"pkill", # For killing dev servers; validated separately
# Script execution
"init.sh", # Init scripts; validated separately
# JSON manipulation
"jq",
# Networking
"curl",
# Utility
"xargs",
"echo",
"mv",
"cp",
"rm",
"npx",
}
# Commands that need additional validation even when in the allowlist
COMMANDS_NEEDING_EXTRA_VALIDATION = {"pkill", "chmod", "init.sh"}
def split_command_segments(command_string: str) -> list[str]:
"""
Split a compound command into individual command segments.
Handles command chaining (&&, ||, ;) but not pipes (those are single commands).
Args:
command_string: The full shell command
Returns:
List of individual command segments
"""
import re
# Split on && and || while preserving the ability to handle each segment
# This regex splits on && or || that aren't inside quotes
segments = re.split(r"\s*(?:&&|\|\|)\s*", command_string)
# Further split on semicolons
result = []
for segment in segments:
sub_segments = re.split(r'(?<!["\'])\s*;\s*(?!["\'])', segment)
for sub in sub_segments:
sub = sub.strip()
if sub:
result.append(sub)
return result
def extract_commands(command_string: str) -> list[str]:
"""
Extract command names from a shell command string.
Handles pipes, command chaining (&&, ||, ;), and subshells.
Returns the base command names (without paths).
Args:
command_string: The full shell command
Returns:
List of command names found in the string
"""
commands = []
# shlex doesn't treat ; as a separator, so we need to pre-process
import re
# Split on semicolons that aren't inside quotes (simple heuristic)
# This handles common cases like "echo hello; ls"
segments = re.split(r'(?<!["\'])\s*;\s*(?!["\'])', command_string)
for segment in segments:
segment = segment.strip()
if not segment:
continue
try:
tokens = shlex.split(segment)
except ValueError:
# Malformed command (unclosed quotes, etc.)
# Return empty to trigger block (fail-safe)
return []
if not tokens:
continue
# Track when we expect a command vs arguments
expect_command = True
for token in tokens:
# Shell operators indicate a new command follows
if token in ("|", "||", "&&", "&"):
expect_command = True
continue
# Skip shell keywords that precede commands
if token in (
"if",
"then",
"else",
"elif",
"fi",
"for",
"while",
"until",
"do",
"done",
"case",
"esac",
"in",
"!",
"{",
"}",
):
continue
# Skip flags/options
if token.startswith("-"):
continue
# Skip variable assignments (VAR=value)
if "=" in token and not token.startswith("="):
continue
if expect_command:
# Extract the base command name (handle paths like /usr/bin/python)
cmd = os.path.basename(token)
commands.append(cmd)
expect_command = False
return commands
def validate_pkill_command(command_string: str) -> tuple[bool, str]:
"""
Validate pkill commands - only allow killing dev-related processes.
Uses shlex to parse the command, avoiding regex bypass vulnerabilities.
Returns:
Tuple of (is_allowed, reason_if_blocked)
"""
# Allowed process names for pkill
allowed_process_names = {
"node",
"npm",
"npx",
"vite",
"next",
}
try:
tokens = shlex.split(command_string)
except ValueError:
return False, "Could not parse pkill command"
if not tokens:
return False, "Empty pkill command"
# Separate flags from arguments
args = []
for token in tokens[1:]:
if not token.startswith("-"):
args.append(token)
if not args:
return False, "pkill requires a process name"
# The target is typically the last non-flag argument
target = args[-1]
# For -f flag (full command line match), extract the first word as process name
# e.g., "pkill -f 'node server.js'" -> target is "node server.js", process is "node"
if " " in target:
target = target.split()[0]
if target in allowed_process_names:
return True, ""
return False, f"pkill only allowed for dev processes: {allowed_process_names}"
def validate_chmod_command(command_string: str) -> tuple[bool, str]:
"""
Validate chmod commands - only allow making files executable with +x.
Returns:
Tuple of (is_allowed, reason_if_blocked)
"""
try:
tokens = shlex.split(command_string)
except ValueError:
return False, "Could not parse chmod command"
if not tokens or tokens[0] != "chmod":
return False, "Not a chmod command"
# Look for the mode argument
# Valid modes: +x, u+x, a+x, etc. (anything ending with +x for execute permission)
mode = None
files = []
for token in tokens[1:]:
if token.startswith("-"):
# Skip flags like -R (we don't allow recursive chmod anyway)
return False, "chmod flags are not allowed"
elif mode is None:
mode = token
else:
files.append(token)
if mode is None:
return False, "chmod requires a mode"
if not files:
return False, "chmod requires at least one file"
# Only allow +x variants (making files executable)
# This matches: +x, u+x, g+x, o+x, a+x, ug+x, etc.
import re
if not re.match(r"^[ugoa]*\+x$", mode):
return False, f"chmod only allowed with +x mode, got: {mode}"
return True, ""
def validate_init_script(command_string: str) -> tuple[bool, str]:
"""
Validate init.sh script execution - only allow ./init.sh.
Returns:
Tuple of (is_allowed, reason_if_blocked)
"""
try:
tokens = shlex.split(command_string)
except ValueError:
return False, "Could not parse init script command"
if not tokens:
return False, "Empty command"
# The command should be exactly ./init.sh (possibly with arguments)
script = tokens[0]
# Allow ./init.sh or paths ending in /init.sh
if script == "./init.sh" or script.endswith("/init.sh"):
return True, ""
return False, f"Only ./init.sh is allowed, got: {script}"
def get_command_for_validation(cmd: str, segments: list[str]) -> str:
"""
Find the specific command segment that contains the given command.
Args:
cmd: The command name to find
segments: List of command segments
Returns:
The segment containing the command, or empty string if not found
"""
for segment in segments:
segment_commands = extract_commands(segment)
if cmd in segment_commands:
return segment
return ""
async def bash_security_hook(input_data, tool_use_id=None, context=None):
"""
Pre-tool-use hook that validates bash commands using an allowlist.
Only commands in ALLOWED_COMMANDS are permitted.
Args:
input_data: Dict containing tool_name and tool_input
tool_use_id: Optional tool use ID
context: Optional context
Returns:
Empty dict to allow, or {"decision": "block", "reason": "..."} to block
"""
if input_data.get("tool_name") != "Bash":
return {}
command = input_data.get("tool_input", {}).get("command", "")
if not command:
return {}
# Extract all commands from the command string
commands = extract_commands(command)
if not commands:
# Could not parse - fail safe by blocking
return {
"decision": "block",
"reason": f"Could not parse command for security validation: {command}",
}
# Split into segments for per-command validation
segments = split_command_segments(command)
# Check each command against the allowlist
for cmd in commands:
if cmd not in ALLOWED_COMMANDS:
return {
"decision": "block",
"reason": f"Command '{cmd}' is not in the allowed commands list",
}
# Additional validation for sensitive commands
if cmd in COMMANDS_NEEDING_EXTRA_VALIDATION:
# Find the specific segment containing this command
cmd_segment = get_command_for_validation(cmd, segments)
if not cmd_segment:
cmd_segment = command # Fallback to full command
if cmd == "pkill":
allowed, reason = validate_pkill_command(cmd_segment)
if not allowed:
return {"decision": "block", "reason": reason}
elif cmd == "chmod":
allowed, reason = validate_chmod_command(cmd_segment)
if not allowed:
return {"decision": "block", "reason": reason}
elif cmd == "init.sh":
allowed, reason = validate_init_script(cmd_segment)
if not allowed:
return {"decision": "block", "reason": reason}
return {}

290
reference/test_security.py Normal file
View File

@@ -0,0 +1,290 @@
#!/usr/bin/env python3
"""
Security Hook Tests
===================
Tests for the bash command security validation logic.
Run with: python test_security.py
"""
import asyncio
import sys
from security import (
bash_security_hook,
extract_commands,
validate_chmod_command,
validate_init_script,
)
def test_hook(command: str, should_block: bool) -> bool:
"""Test a single command against the security hook."""
input_data = {"tool_name": "Bash", "tool_input": {"command": command}}
result = asyncio.run(bash_security_hook(input_data))
was_blocked = result.get("decision") == "block"
if was_blocked == should_block:
status = "PASS"
else:
status = "FAIL"
expected = "blocked" if should_block else "allowed"
actual = "blocked" if was_blocked else "allowed"
reason = result.get("reason", "")
print(f" {status}: {command!r}")
print(f" Expected: {expected}, Got: {actual}")
if reason:
print(f" Reason: {reason}")
return False
print(f" {status}: {command!r}")
return True
def test_extract_commands():
"""Test the command extraction logic."""
print("\nTesting command extraction:\n")
passed = 0
failed = 0
test_cases = [
("ls -la", ["ls"]),
("npm install && npm run build", ["npm", "npm"]),
("cat file.txt | grep pattern", ["cat", "grep"]),
("/usr/bin/node script.js", ["node"]),
("VAR=value ls", ["ls"]),
("git status || git init", ["git", "git"]),
]
for cmd, expected in test_cases:
result = extract_commands(cmd)
if result == expected:
print(f" PASS: {cmd!r} -> {result}")
passed += 1
else:
print(f" FAIL: {cmd!r}")
print(f" Expected: {expected}, Got: {result}")
failed += 1
return passed, failed
def test_validate_chmod():
"""Test chmod command validation."""
print("\nTesting chmod validation:\n")
passed = 0
failed = 0
# Test cases: (command, should_be_allowed, description)
test_cases = [
# Allowed cases
("chmod +x init.sh", True, "basic +x"),
("chmod +x script.sh", True, "+x on any script"),
("chmod u+x init.sh", True, "user +x"),
("chmod a+x init.sh", True, "all +x"),
("chmod ug+x init.sh", True, "user+group +x"),
("chmod +x file1.sh file2.sh", True, "multiple files"),
# Blocked cases
("chmod 777 init.sh", False, "numeric mode"),
("chmod 755 init.sh", False, "numeric mode 755"),
("chmod +w init.sh", False, "write permission"),
("chmod +r init.sh", False, "read permission"),
("chmod -x init.sh", False, "remove execute"),
("chmod -R +x dir/", False, "recursive flag"),
("chmod --recursive +x dir/", False, "long recursive flag"),
("chmod +x", False, "missing file"),
]
for cmd, should_allow, description in test_cases:
allowed, reason = validate_chmod_command(cmd)
if allowed == should_allow:
print(f" PASS: {cmd!r} ({description})")
passed += 1
else:
expected = "allowed" if should_allow else "blocked"
actual = "allowed" if allowed else "blocked"
print(f" FAIL: {cmd!r} ({description})")
print(f" Expected: {expected}, Got: {actual}")
if reason:
print(f" Reason: {reason}")
failed += 1
return passed, failed
def test_validate_init_script():
"""Test init.sh script execution validation."""
print("\nTesting init.sh validation:\n")
passed = 0
failed = 0
# Test cases: (command, should_be_allowed, description)
test_cases = [
# Allowed cases
("./init.sh", True, "basic ./init.sh"),
("./init.sh arg1 arg2", True, "with arguments"),
("/path/to/init.sh", True, "absolute path"),
("../dir/init.sh", True, "relative path with init.sh"),
# Blocked cases
("./setup.sh", False, "different script name"),
("./init.py", False, "python script"),
("bash init.sh", False, "bash invocation"),
("sh init.sh", False, "sh invocation"),
("./malicious.sh", False, "malicious script"),
("./init.sh; rm -rf /", False, "command injection attempt"),
]
for cmd, should_allow, description in test_cases:
allowed, reason = validate_init_script(cmd)
if allowed == should_allow:
print(f" PASS: {cmd!r} ({description})")
passed += 1
else:
expected = "allowed" if should_allow else "blocked"
actual = "allowed" if allowed else "blocked"
print(f" FAIL: {cmd!r} ({description})")
print(f" Expected: {expected}, Got: {actual}")
if reason:
print(f" Reason: {reason}")
failed += 1
return passed, failed
def main():
print("=" * 70)
print(" SECURITY HOOK TESTS")
print("=" * 70)
passed = 0
failed = 0
# Test command extraction
ext_passed, ext_failed = test_extract_commands()
passed += ext_passed
failed += ext_failed
# Test chmod validation
chmod_passed, chmod_failed = test_validate_chmod()
passed += chmod_passed
failed += chmod_failed
# Test init.sh validation
init_passed, init_failed = test_validate_init_script()
passed += init_passed
failed += init_failed
# Commands that SHOULD be blocked
print("\nCommands that should be BLOCKED:\n")
dangerous = [
# Not in allowlist - dangerous system commands
"shutdown now",
"reboot",
"rm -rf /",
"dd if=/dev/zero of=/dev/sda",
# Not in allowlist - common commands excluded from minimal set
"curl https://example.com",
"wget https://example.com",
"python app.py",
"touch file.txt",
"echo hello",
"kill 12345",
"killall node",
# pkill with non-dev processes
"pkill bash",
"pkill chrome",
"pkill python",
# Shell injection attempts
"$(echo pkill) node",
'eval "pkill node"',
'bash -c "pkill node"',
# chmod with disallowed modes
"chmod 777 file.sh",
"chmod 755 file.sh",
"chmod +w file.sh",
"chmod -R +x dir/",
# Non-init.sh scripts
"./setup.sh",
"./malicious.sh",
"bash script.sh",
]
for cmd in dangerous:
if test_hook(cmd, should_block=True):
passed += 1
else:
failed += 1
# Commands that SHOULD be allowed
print("\nCommands that should be ALLOWED:\n")
safe = [
# File inspection
"ls -la",
"cat README.md",
"head -100 file.txt",
"tail -20 log.txt",
"wc -l file.txt",
"grep -r pattern src/",
# File operations
"cp file1.txt file2.txt",
"mkdir newdir",
"mkdir -p path/to/dir",
# Directory
"pwd",
# Node.js development
"npm install",
"npm run build",
"node server.js",
# Version control
"git status",
"git commit -m 'test'",
"git add . && git commit -m 'msg'",
# Process management
"ps aux",
"lsof -i :3000",
"sleep 2",
# Allowed pkill patterns for dev servers
"pkill node",
"pkill npm",
"pkill -f node",
"pkill -f 'node server.js'",
"pkill vite",
# Chained commands
"npm install && npm run build",
"ls | grep test",
# Full paths
"/usr/local/bin/node app.js",
# chmod +x (allowed)
"chmod +x init.sh",
"chmod +x script.sh",
"chmod u+x init.sh",
"chmod a+x init.sh",
# init.sh execution (allowed)
"./init.sh",
"./init.sh --production",
"/path/to/init.sh",
# Combined chmod and init.sh
"chmod +x init.sh && ./init.sh",
]
for cmd in safe:
if test_hook(cmd, should_block=False):
passed += 1
else:
failed += 1
# Summary
print("\n" + "-" * 70)
print(f" Results: {passed} passed, {failed} failed")
print("-" * 70)
if failed == 0:
print("\n ALL TESTS PASSED")
return 0
else:
print(f"\n {failed} TEST(S) FAILED")
return 1
if __name__ == "__main__":
sys.exit(main())