initial commit

2026-03-18 10:23:07 +00:00 · 2025-12-07 16:43:26 -05:00
commit 3c8e786f29
70 changed files with 21487 additions and 0 deletions
--- a/reference/.gitignore
+++ b/reference/.gitignore
@@ -0,0 +1,4 @@
+# Agent-generated output directories
+
+# Log files
+logs/
--- a/reference/README.md
+++ b/reference/README.md
@@ -0,0 +1,163 @@
+# Autonomous Coding Agent Demo
+
+A minimal harness demonstrating long-running autonomous coding with the Claude Agent SDK. This demo implements a two-agent pattern (initializer + coding agent) that can build complete applications over multiple sessions.
+
+## Prerequisites
+
+**Required:** Install the latest versions of both Claude Code and the Claude Agent SDK:
+
+```bash
+# Install Claude Code CLI (latest version required)
+npm install -g @anthropic-ai/claude-code
+
+# Install Python dependencies
+pip install -r requirements.txt
+```
+
+Verify your installations:
+```bash
+claude --version  # Should be latest version
+pip show claude-code-sdk  # Check SDK is installed
+```
+
+**API Key:** Set your Anthropic API key:
+```bash
+export ANTHROPIC_API_KEY='your-api-key-here'
+```
+
+## Quick Start
+
+```bash
+python autonomous_agent_demo.py --project-dir ./my_project
+```
+
+For testing with limited iterations:
+```bash
+python autonomous_agent_demo.py --project-dir ./my_project --max-iterations 3
+```
+
+## Important Timing Expectations
+
+> **Warning: This demo takes a long time to run!**
+
+- **First session (initialization):** The agent generates a `feature_list.json` with 200 test cases. This takes several minutes and may appear to hang - this is normal. The agent is writing out all the features.
+
+- **Subsequent sessions:** Each coding iteration can take **5-15 minutes** depending on complexity.
+
+- **Full app:** Building all 200 features typically requires **many hours** of total runtime across multiple sessions.
+
+**Tip:** The 200 features parameter in the prompts is designed for comprehensive coverage. If you want faster demos, you can modify `prompts/initializer_prompt.md` to reduce the feature count (e.g., 20-50 features for a quicker demo).
+
+## How It Works
+
+### Two-Agent Pattern
+
+1. **Initializer Agent (Session 1):** Reads `app_spec.txt`, creates `feature_list.json` with 200 test cases, sets up project structure, and initializes git.
+
+2. **Coding Agent (Sessions 2+):** Picks up where the previous session left off, implements features one by one, and marks them as passing in `feature_list.json`.
+
+### Session Management
+
+- Each session runs with a fresh context window
+- Progress is persisted via `feature_list.json` and git commits
+- The agent auto-continues between sessions (3 second delay)
+- Press `Ctrl+C` to pause; run the same command to resume
+
+## Security Model
+
+This demo uses a defense-in-depth security approach (see `security.py` and `client.py`):
+
+1. **OS-level Sandbox:** Bash commands run in an isolated environment
+2. **Filesystem Restrictions:** File operations restricted to the project directory only
+3. **Bash Allowlist:** Only specific commands are permitted:
+   - File inspection: `ls`, `cat`, `head`, `tail`, `wc`, `grep`
+   - Node.js: `npm`, `node`
+   - Version control: `git`
+   - Process management: `ps`, `lsof`, `sleep`, `pkill` (dev processes only)
+
+Commands not in the allowlist are blocked by the security hook.
+
+## Project Structure
+
+```
+autonomous-coding/
+├── autonomous_agent_demo.py  # Main entry point
+├── agent.py                  # Agent session logic
+├── client.py                 # Claude SDK client configuration
+├── security.py               # Bash command allowlist and validation
+├── progress.py               # Progress tracking utilities
+├── prompts.py                # Prompt loading utilities
+├── prompts/
+│   ├── app_spec.txt          # Application specification
+│   ├── initializer_prompt.md # First session prompt
+│   └── coding_prompt.md      # Continuation session prompt
+└── requirements.txt          # Python dependencies
+```
+
+## Generated Project Structure
+
+After running, your project directory will contain:
+
+```
+my_project/
+├── feature_list.json         # Test cases (source of truth)
+├── app_spec.txt              # Copied specification
+├── init.sh                   # Environment setup script
+├── claude-progress.txt       # Session progress notes
+├── .claude_settings.json     # Security settings
+└── [application files]       # Generated application code
+```
+
+## Running the Generated Application
+
+After the agent completes (or pauses), you can run the generated application:
+
+```bash
+cd generations/my_project
+
+# Run the setup script created by the agent
+./init.sh
+
+# Or manually (typical for Node.js apps):
+npm install
+npm run dev
+```
+
+The application will typically be available at `http://localhost:3000` or similar (check the agent's output or `init.sh` for the exact URL).
+
+## Command Line Options
+
+| Option | Description | Default |
+|--------|-------------|---------|
+| `--project-dir` | Directory for the project | `./autonomous_demo_project` |
+| `--max-iterations` | Max agent iterations | Unlimited |
+| `--model` | Claude model to use | `claude-sonnet-4-5-20250929` |
+
+## Customization
+
+### Changing the Application
+
+Edit `prompts/app_spec.txt` to specify a different application to build.
+
+### Adjusting Feature Count
+
+Edit `prompts/initializer_prompt.md` and change the "200 features" requirement to a smaller number for faster demos.
+
+### Modifying Allowed Commands
+
+Edit `security.py` to add or remove commands from `ALLOWED_COMMANDS`.
+
+## Troubleshooting
+
+**"Appears to hang on first run"**
+This is normal. The initializer agent is generating 200 detailed test cases, which takes significant time. Watch for `[Tool: ...]` output to confirm the agent is working.
+
+**"Command blocked by security hook"**
+The agent tried to run a command not in the allowlist. This is the security system working as intended. If needed, add the command to `ALLOWED_COMMANDS` in `security.py`.
+
+**"API key not set"**
+Ensure `ANTHROPIC_API_KEY` is exported in your shell environment.
+
+## License
+
+Internal Anthropic use.
--- a/reference/SETUP.md
+++ b/reference/SETUP.md
@@ -0,0 +1,99 @@
+# Autonomous Coding Agent Setup
+
+This autonomous coding agent now uses the **Claude Code CLI directly** instead of the Python SDK.
+
+## Prerequisites
+
+1. **Claude Code** must be installed on your system
+2. You must authenticate Claude Code for **headless mode** (--print flag)
+
+## Authentication Setup
+
+The `--print` (headless) mode requires a long-lived authentication token. To set this up:
+
+### Option 1: Setup Token (Recommended)
+
+Run this command in your own terminal (requires Claude subscription):
+
+```bash
+claude setup-token
+```
+
+This will open your browser and authenticate Claude Code for headless usage.
+
+### Option 2: Use API Key
+
+If you have an Anthropic API key instead:
+
+```bash
+export ANTHROPIC_API_KEY='your-api-key-here'
+```
+
+Or for OAuth tokens:
+
+```bash
+export CLAUDE_CODE_OAUTH_TOKEN='your-oauth-token-here'
+```
+
+## Usage
+
+Once authenticated, run:
+
+```bash
+python3 autonomous_agent_demo.py --project-dir ./my_project --max-iterations 3
+```
+
+### Options:
+
+- `--project-dir`: Directory for your project (default: `./autonomous_demo_project`)
+- `--max-iterations`: Maximum number of agent iterations (default: unlimited)
+- `--model`: Claude model to use (default: `opus` for Opus 4.5)
+
+### Examples:
+
+```bash
+# Start a new project with Opus 4.5
+python3 autonomous_agent_demo.py --project-dir ./my_app
+
+# Limit iterations for testing
+python3 autonomous_agent_demo.py --project-dir ./my_app --max-iterations 5
+
+# Use a different model
+python3 autonomous_agent_demo.py --project-dir ./my_app --model sonnet
+```
+
+## How It Works
+
+The agent:
+
+1. Creates configuration files (`.claude_settings.json`, `.mcp_config.json`)
+2. Calls `claude --print` with your prompt
+3. Captures the output and continues the autonomous loop
+4. Uses your existing Claude Code authentication
+
+## Troubleshooting
+
+### "Invalid API key" Error
+
+This means Claude Code isn't authenticated for headless mode. Run:
+
+```bash
+claude setup-token
+```
+
+### Check Authentication Status
+
+Test if headless mode works:
+
+```bash
+echo "Hello" | claude --print --model opus
+```
+
+If this works, the autonomous agent will work too.
+
+### Still Having Issues?
+
+1. Make sure Claude Code is installed: `claude --version`
+2. Check that you can run Claude normally: `claude`
+3. Verify `claude` is in your PATH: `which claude`
+4. Try re-authenticating: `claude setup-token`
--- a/reference/agent.py
+++ b/reference/agent.py
@@ -0,0 +1,206 @@
+"""
+Agent Session Logic
+===================
+
+Core agent interaction functions for running autonomous coding sessions.
+"""
+
+import asyncio
+from pathlib import Path
+from typing import Optional
+
+from claude_code_sdk import ClaudeSDKClient
+
+from client import create_client
+from progress import print_session_header, print_progress_summary
+from prompts import get_initializer_prompt, get_coding_prompt, copy_spec_to_project
+
+
+# Configuration
+AUTO_CONTINUE_DELAY_SECONDS = 3
+
+
+async def run_agent_session(
+    client: ClaudeSDKClient,
+    message: str,
+    project_dir: Path,
+) -> tuple[str, str]:
+    """
+    Run a single agent session using Claude Agent SDK.
+
+    Args:
+        client: Claude SDK client
+        message: The prompt to send
+        project_dir: Project directory path
+
+    Returns:
+        (status, response_text) where status is:
+        - "continue" if agent should continue working
+        - "error" if an error occurred
+    """
+    print("Sending prompt to Claude Agent SDK...\n")
+
+    try:
+        # Send the query
+        await client.query(message)
+
+        # Collect response text and show tool use
+        response_text = ""
+        async for msg in client.receive_response():
+            msg_type = type(msg).__name__
+
+            # Handle AssistantMessage (text and tool use)
+            if msg_type == "AssistantMessage" and hasattr(msg, "content"):
+                for block in msg.content:
+                    block_type = type(block).__name__
+
+                    if block_type == "TextBlock" and hasattr(block, "text"):
+                        response_text += block.text
+                        print(block.text, end="", flush=True)
+                    elif block_type == "ToolUseBlock" and hasattr(block, "name"):
+                        print(f"\n[Tool: {block.name}]", flush=True)
+                        if hasattr(block, "input"):
+                            input_str = str(block.input)
+                            if len(input_str) > 200:
+                                print(f"   Input: {input_str[:200]}...", flush=True)
+                            else:
+                                print(f"   Input: {input_str}", flush=True)
+
+            # Handle UserMessage (tool results)
+            elif msg_type == "UserMessage" and hasattr(msg, "content"):
+                for block in msg.content:
+                    block_type = type(block).__name__
+
+                    if block_type == "ToolResultBlock":
+                        result_content = getattr(block, "content", "")
+                        is_error = getattr(block, "is_error", False)
+
+                        # Check if command was blocked by security hook
+                        if "blocked" in str(result_content).lower():
+                            print(f"   [BLOCKED] {result_content}", flush=True)
+                        elif is_error:
+                            # Show errors (truncated)
+                            error_str = str(result_content)[:500]
+                            print(f"   [Error] {error_str}", flush=True)
+                        else:
+                            # Tool succeeded - just show brief confirmation
+                            print("   [Done]", flush=True)
+
+        print("\n" + "-" * 70 + "\n")
+        return "continue", response_text
+
+    except Exception as e:
+        print(f"Error during agent session: {e}")
+        return "error", str(e)
+
+
+async def run_autonomous_agent(
+    project_dir: Path,
+    model: str,
+    max_iterations: Optional[int] = None,
+) -> None:
+    """
+    Run the autonomous agent loop.
+
+    Args:
+        project_dir: Directory for the project
+        model: Claude model to use
+        max_iterations: Maximum number of iterations (None for unlimited)
+    """
+    print("\n" + "=" * 70)
+    print("  AUTONOMOUS CODING AGENT DEMO")
+    print("=" * 70)
+    print(f"\nProject directory: {project_dir}")
+    print(f"Model: {model}")
+    if max_iterations:
+        print(f"Max iterations: {max_iterations}")
+    else:
+        print("Max iterations: Unlimited (will run until completion)")
+    print()
+
+    # Create project directory
+    project_dir.mkdir(parents=True, exist_ok=True)
+
+    # Check if this is a fresh start or continuation
+    tests_file = project_dir / "feature_list.json"
+    is_first_run = not tests_file.exists()
+
+    if is_first_run:
+        print("Fresh start - will use initializer agent")
+        print()
+        print("=" * 70)
+        print("  NOTE: First session takes 10-20+ minutes!")
+        print("  The agent is generating 200 detailed test cases.")
+        print("  This may appear to hang - it's working. Watch for [Tool: ...] output.")
+        print("=" * 70)
+        print()
+        # Copy the app spec into the project directory for the agent to read
+        copy_spec_to_project(project_dir)
+    else:
+        print("Continuing existing project")
+        print_progress_summary(project_dir)
+
+    # Main loop
+    iteration = 0
+
+    while True:
+        iteration += 1
+
+        # Check max iterations
+        if max_iterations and iteration > max_iterations:
+            print(f"\nReached max iterations ({max_iterations})")
+            print("To continue, run the script again without --max-iterations")
+            break
+
+        # Print session header
+        print_session_header(iteration, is_first_run)
+
+        # Create client (fresh context)
+        client = create_client(project_dir, model)
+
+        # Choose prompt based on session type
+        if is_first_run:
+            prompt = get_initializer_prompt()
+            is_first_run = False  # Only use initializer once
+        else:
+            prompt = get_coding_prompt()
+
+        # Run session with async context manager
+        async with client:
+            status, response = await run_agent_session(client, prompt, project_dir)
+
+        # Handle status
+        if status == "continue":
+            print(f"\nAgent will auto-continue in {AUTO_CONTINUE_DELAY_SECONDS}s...")
+            print_progress_summary(project_dir)
+            await asyncio.sleep(AUTO_CONTINUE_DELAY_SECONDS)
+
+        elif status == "error":
+            print("\nSession encountered an error")
+            print("Will retry with a fresh session...")
+            await asyncio.sleep(AUTO_CONTINUE_DELAY_SECONDS)
+
+        # Small delay between sessions
+        if max_iterations is None or iteration < max_iterations:
+            print("\nPreparing next session...\n")
+            await asyncio.sleep(1)
+
+    # Final summary
+    print("\n" + "=" * 70)
+    print("  SESSION COMPLETE")
+    print("=" * 70)
+    print(f"\nProject directory: {project_dir}")
+    print_progress_summary(project_dir)
+
+    # Print instructions for running the generated application
+    print("\n" + "-" * 70)
+    print("  TO RUN THE GENERATED APPLICATION:")
+    print("-" * 70)
+    print(f"\n  cd {project_dir.resolve()}")
+    print("  ./init.sh           # Run the setup script")
+    print("  # Or manually:")
+    print("  npm install && npm run dev")
+    print("\n  Then open http://localhost:3000 (or check init.sh for the URL)")
+    print("-" * 70)
+
+    print("\nDone!")
--- a/reference/autonomous_agent_demo.py
+++ b/reference/autonomous_agent_demo.py
@@ -0,0 +1,123 @@
+#!/usr/bin/env python3
+"""
+Autonomous Coding Agent Demo
+============================
+
+A minimal harness demonstrating long-running autonomous coding with Claude.
+This script implements the two-agent pattern (initializer + coding agent) and
+incorporates all the strategies from the long-running agents guide.
+
+Example Usage:
+    python autonomous_agent_demo.py --project-dir ./claude_clone_demo
+    python autonomous_agent_demo.py --project-dir ./claude_clone_demo --max-iterations 5
+"""
+
+import argparse
+import asyncio
+import os
+from pathlib import Path
+
+from agent import run_autonomous_agent
+
+
+# Configuration
+# DEFAULT_MODEL = "claude-haiku-4-5-20251001"
+# DEFAULT_MODEL = "claude-sonnet-4-5-20250929"
+DEFAULT_MODEL = "claude-opus-4-5-20251101"
+
+
+def parse_args() -> argparse.Namespace:
+    """Parse command line arguments."""
+    parser = argparse.ArgumentParser(
+        description="Autonomous Coding Agent Demo - Long-running agent harness",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Start fresh project
+  python autonomous_agent_demo.py --project-dir ./claude_clone
+
+  # Use a specific model
+  python autonomous_agent_demo.py --project-dir ./claude_clone --model claude-sonnet-4-5-20250929
+
+  # Limit iterations for testing
+  python autonomous_agent_demo.py --project-dir ./claude_clone --max-iterations 5
+
+  # Continue existing project
+  python autonomous_agent_demo.py --project-dir ./claude_clone
+
+Environment Variables:
+  ANTHROPIC_API_KEY    Your Anthropic API key (required)
+        """,
+    )
+
+    parser.add_argument(
+        "--project-dir",
+        type=Path,
+        default=Path("./autonomous_demo_project"),
+        help="Directory for the project (default: generations/autonomous_demo_project). Relative paths automatically placed in generations/ directory.",
+    )
+
+    parser.add_argument(
+        "--max-iterations",
+        type=int,
+        default=None,
+        help="Maximum number of agent iterations (default: unlimited)",
+    )
+
+    parser.add_argument(
+        "--model",
+        type=str,
+        default=DEFAULT_MODEL,
+        help=f"Claude model to use (default: {DEFAULT_MODEL})",
+    )
+
+    return parser.parse_args()
+
+
+def main() -> None:
+    """Main entry point."""
+    args = parse_args()
+
+    # Check for auth: allow either API key or Claude Code auth token
+    has_api_key = bool(os.environ.get("ANTHROPIC_API_KEY"))
+    has_oauth_token = bool(os.environ.get("CLAUDE_CODE_OAUTH_TOKEN"))
+
+    if not (has_api_key or has_oauth_token):
+        print("Error: No Claude auth configured.")
+        print("\nSet ONE of the following:")
+        print("  # Standard API key from console.anthropic.com")
+        print("  export ANTHROPIC_API_KEY='your-api-key-here'")
+        print("\n  # Or, your Claude Code auth token (from `claude setup-token`)")
+        print("  export CLAUDE_CODE_OAUTH_TOKEN='your-claude-code-auth-token'")
+        return
+
+    # Automatically place projects in generations/ directory unless already specified
+    project_dir = args.project_dir
+    if not str(project_dir).startswith("generations/"):
+        # Convert relative paths to be under generations/
+        if project_dir.is_absolute():
+            # If absolute path, use as-is
+            pass
+        else:
+            # Prepend generations/ to relative paths
+            project_dir = Path("generations") / project_dir
+
+    # Run the agent
+    try:
+        asyncio.run(
+            run_autonomous_agent(
+                project_dir=project_dir,
+                model=args.model,
+                max_iterations=args.max_iterations,
+            )
+        )
+    except KeyboardInterrupt:
+        print("\n\nInterrupted by user")
+        print("To resume, run the same command again")
+    except Exception as e:
+        print(f"\nFatal error: {e}")
+        raise
+
+
+if __name__ == "__main__":
+    main()
--- a/reference/client.py
+++ b/reference/client.py
@@ -0,0 +1,130 @@
+"""
+Claude SDK Client Configuration
+===============================
+
+Functions for creating and configuring the Claude Agent SDK client.
+"""
+
+import json
+import os
+from pathlib import Path
+
+from claude_code_sdk import ClaudeCodeOptions, ClaudeSDKClient
+from claude_code_sdk.types import HookMatcher
+
+from security import bash_security_hook
+
+
+# Puppeteer MCP tools for browser automation
+PUPPETEER_TOOLS = [
+    "mcp__puppeteer__puppeteer_navigate",
+    "mcp__puppeteer__puppeteer_screenshot",
+    "mcp__puppeteer__puppeteer_click",
+    "mcp__puppeteer__puppeteer_fill",
+    "mcp__puppeteer__puppeteer_select",
+    "mcp__puppeteer__puppeteer_hover",
+    "mcp__puppeteer__puppeteer_evaluate",
+]
+
+# Built-in tools
+BUILTIN_TOOLS = [
+    "Read",
+    "Write",
+    "Edit",
+    "Glob",
+    "Grep",
+    "Bash",
+]
+
+
+def create_client(project_dir: Path, model: str) -> ClaudeSDKClient:
+    """Create a Claude Agent SDK client with multi-layered security.
+
+    Auth options
+    ------------
+    This demo supports two ways of authenticating:
+    1. API key via ``ANTHROPIC_API_KEY`` (standard Claude API key)
+    2. Claude Code auth token via ``CLAUDE_CODE_OAUTH_TOKEN``
+
+    If neither is set, client creation will fail with a clear error.
+
+    Args:
+        project_dir: Directory for the project
+        model: Claude model to use
+
+    Returns:
+        Configured ClaudeSDKClient
+
+    Security layers (defense in depth):
+    1. Sandbox - OS-level bash command isolation prevents filesystem escape
+    2. Permissions - File operations restricted to project_dir only
+    3. Security hooks - Bash commands validated against an allowlist
+       (see security.py for ALLOWED_COMMANDS)
+    """
+    api_key = os.environ.get("ANTHROPIC_API_KEY")
+    oauth_token = os.environ.get("CLAUDE_CODE_OAUTH_TOKEN")
+    if not api_key and not oauth_token:
+        raise ValueError(
+            "No Claude auth configured. Set either ANTHROPIC_API_KEY (Claude API key) "
+            "or CLAUDE_CODE_OAUTH_TOKEN (Claude Code auth token from `claude setup-token`)."
+        )
+
+    # Create comprehensive security settings
+    # Note: Using relative paths ("./**") restricts access to project directory
+    # since cwd is set to project_dir
+    security_settings = {
+        "sandbox": {"enabled": True, "autoAllowBashIfSandboxed": True},
+        "permissions": {
+            "defaultMode": "acceptEdits",  # Auto-approve edits within allowed directories
+            "allow": [
+                # Allow all file operations within the project directory
+                "Read(./**)",
+                "Write(./**)",
+                "Edit(./**)",
+                "Glob(./**)",
+                "Grep(./**)",
+                # Bash permission granted here, but actual commands are validated
+                # by the bash_security_hook (see security.py for allowed commands)
+                "Bash(*)",
+                # Allow Puppeteer MCP tools for browser automation
+                *PUPPETEER_TOOLS,
+            ],
+        },
+    }
+
+    # Ensure project directory exists before creating settings file
+    project_dir.mkdir(parents=True, exist_ok=True)
+
+    # Write settings to a file in the project directory
+    settings_file = project_dir / ".claude_settings.json"
+    with open(settings_file, "w") as f:
+        json.dump(security_settings, f, indent=2)
+
+    print(f"Created security settings at {settings_file}")
+    print("   - Sandbox enabled (OS-level bash isolation)")
+    print(f"   - Filesystem restricted to: {project_dir.resolve()}")
+    print("   - Bash commands restricted to allowlist (see security.py)")
+    print("   - MCP servers: puppeteer (browser automation)")
+    print()
+
+    return ClaudeSDKClient(
+        options=ClaudeCodeOptions(
+            model=model,
+            system_prompt="You are an expert full-stack developer building a production-quality web application.",
+            allowed_tools=[
+                *BUILTIN_TOOLS,
+                *PUPPETEER_TOOLS,
+            ],
+            mcp_servers={
+                "puppeteer": {"command": "npx", "args": ["puppeteer-mcp-server"]}
+            },
+            hooks={
+                "PreToolUse": [
+                    HookMatcher(matcher="Bash", hooks=[bash_security_hook]),
+                ],
+            },
+            max_turns=1000,
+            cwd=str(project_dir.resolve()),
+            settings=str(settings_file.resolve()),  # Use absolute path
+        )
+    )
--- a/reference/progress.py
+++ b/reference/progress.py
@@ -0,0 +1,57 @@
+"""
+Progress Tracking Utilities
+===========================
+
+Functions for tracking and displaying progress of the autonomous coding agent.
+"""
+
+import json
+from pathlib import Path
+
+
+def count_passing_tests(project_dir: Path) -> tuple[int, int]:
+    """
+    Count passing and total tests in feature_list.json.
+
+    Args:
+        project_dir: Directory containing feature_list.json
+
+    Returns:
+        (passing_count, total_count)
+    """
+    tests_file = project_dir / "feature_list.json"
+
+    if not tests_file.exists():
+        return 0, 0
+
+    try:
+        with open(tests_file, "r") as f:
+            tests = json.load(f)
+
+        total = len(tests)
+        passing = sum(1 for test in tests if test.get("passes", False))
+
+        return passing, total
+    except (json.JSONDecodeError, IOError):
+        return 0, 0
+
+
+def print_session_header(session_num: int, is_initializer: bool) -> None:
+    """Print a formatted header for the session."""
+    session_type = "INITIALIZER" if is_initializer else "CODING AGENT"
+
+    print("\n" + "=" * 70)
+    print(f"  SESSION {session_num}: {session_type}")
+    print("=" * 70)
+    print()
+
+
+def print_progress_summary(project_dir: Path) -> None:
+    """Print a summary of current progress."""
+    passing, total = count_passing_tests(project_dir)
+
+    if total > 0:
+        percentage = (passing / total) * 100
+        print(f"\nProgress: {passing}/{total} tests passing ({percentage:.1f}%)")
+    else:
+        print("\nProgress: feature_list.json not yet created")
--- a/reference/prompts.py
+++ b/reference/prompts.py
@@ -0,0 +1,37 @@
+"""
+Prompt Loading Utilities
+========================
+
+Functions for loading prompt templates from the prompts directory.
+"""
+
+import shutil
+from pathlib import Path
+
+
+PROMPTS_DIR = Path(__file__).parent / "prompts"
+
+
+def load_prompt(name: str) -> str:
+    """Load a prompt template from the prompts directory."""
+    prompt_path = PROMPTS_DIR / f"{name}.md"
+    return prompt_path.read_text()
+
+
+def get_initializer_prompt() -> str:
+    """Load the initializer prompt."""
+    return load_prompt("initializer_prompt")
+
+
+def get_coding_prompt() -> str:
+    """Load the coding agent prompt."""
+    return load_prompt("coding_prompt")
+
+
+def copy_spec_to_project(project_dir: Path) -> None:
+    """Copy the app spec file into the project directory for the agent to read."""
+    spec_source = PROMPTS_DIR / "app_spec.txt"
+    spec_dest = project_dir / "app_spec.txt"
+    if not spec_dest.exists():
+        shutil.copy(spec_source, spec_dest)
+        print("Copied app_spec.txt to project directory")
--- a/reference/prompts/app_spec.txt
+++ b/reference/prompts/app_spec.txt
--- a/reference/prompts/coding_prompt.md
+++ b/reference/prompts/coding_prompt.md
@@ -0,0 +1,291 @@
+## YOUR ROLE - CODING AGENT
+
+You are continuing work on a long-running autonomous development task.
+This is a FRESH context window - you have no memory of previous sessions.
+
+### STEP 1: GET YOUR BEARINGS (MANDATORY)
+
+Start by orienting yourself:
+
+```bash
+# 1. See your working directory
+pwd
+
+# 2. List files to understand project structure
+ls -la
+
+# 3. Read the project specification to understand what you're building
+cat app_spec.txt
+
+# 4. Read the feature list to see all work
+cat feature_list.json | head -50
+
+# 5. Read progress notes from previous sessions
+cat claude-progress.txt
+
+# 6. Check recent git history
+git log --oneline -20
+
+# 7. Count remaining tests
+cat feature_list.json | grep '"passes": false' | wc -l
+```
+
+Understanding the `app_spec.txt` is critical - it contains the full requirements
+for the application you're building.
+
+### STEP 2: START SERVERS (IF NOT RUNNING)
+
+If `init.sh` exists, run it:
+
+```bash
+chmod +x init.sh
+./init.sh
+```
+
+Otherwise, start servers manually and document the process.
+
+### STEP 3: VERIFICATION TEST (CRITICAL!)
+
+**MANDATORY BEFORE NEW WORK:**
+
+The previous session may have introduced bugs. Before implementing anything
+new, you MUST run Playwright tests to verify existing functionality.
+
+```bash
+# Run all existing Playwright tests
+npx playwright test
+
+# Or run tests for a specific feature
+npx playwright test tests/[feature-name].spec.ts
+```
+
+If Playwright tests don't exist yet, create them in a `tests/` directory before proceeding.
+
+**If any tests fail:**
+
+- Mark that feature as "passes": false immediately in feature_list.json
+- Fix all failing tests BEFORE moving to new features
+- This includes UI bugs like:
+  - White-on-white text or poor contrast
+  - Random characters displayed
+  - Incorrect timestamps
+  - Layout issues or overflow
+  - Buttons too close together
+  - Missing hover states
+  - Console errors
+
+### STEP 4: CHOOSE ONE FEATURE TO IMPLEMENT
+
+Look at feature_list.json and find the highest-priority feature with "passes": false.
+
+Focus on completing one feature perfectly and completing its testing steps in this session before moving on to other features.
+It's ok if you only complete one feature in this session, as there will be more sessions later that continue to make progress.
+
+### STEP 5: IMPLEMENT THE FEATURE
+
+Implement the chosen feature thoroughly:
+
+1. Write the code (frontend and/or backend as needed)
+2. Write a Playwright happy path test for the feature (see Step 6)
+3. Run the test and fix any issues discovered
+4. Verify all tests pass before moving on
+
+### STEP 6: VERIFY WITH PLAYWRIGHT TESTS
+
+**CRITICAL:** You MUST verify features by writing and running Playwright tests.
+
+**Write Happy Path Tests:**
+
+For each feature, write a Playwright test that covers the happy path - the main user flow that should work correctly. These tests are fast to run and provide quick feedback.
+
+```bash
+# Example: Create test file
+# tests/[feature-name].spec.ts
+
+# Run the specific test
+npx playwright test tests/[feature-name].spec.ts
+
+# Run with headed mode to see the browser (useful for debugging)
+npx playwright test tests/[feature-name].spec.ts --headed
+```
+
+**Test Structure (example):**
+
+```typescript
+import { test, expect } from "@playwright/test";
+
+test("user can send a message and receive response", async ({ page }) => {
+  await page.goto("http://localhost:3000");
+
+  // Happy path: main user flow
+  await page.fill('[data-testid="message-input"]', "Hello world");
+  await page.click('[data-testid="send-button"]');
+
+  // Verify the expected outcome
+  await expect(page.locator('[data-testid="message-list"]')).toContainText(
+    "Hello world"
+  );
+});
+```
+
+**DO:**
+
+- Write tests that cover the primary user workflow (happy path)
+- Use `data-testid` attributes for reliable selectors
+- Run tests frequently during development
+- Keep tests fast and focused
+
+**DON'T:**
+
+- Only test with curl commands (backend testing alone is insufficient)
+- Write overly complex tests with many edge cases initially
+- Skip running tests before marking features as passing
+- Mark tests passing without all Playwright tests green
+- Increase any playwright timeouts past 10s
+
+### STEP 7: UPDATE feature_list.json (CAREFULLY!)
+
+**YOU CAN ONLY MODIFY ONE FIELD: "passes"**
+
+After thorough verification, change:
+
+```json
+"passes": false
+```
+
+to:
+
+```json
+"passes": true
+```
+
+**NEVER:**
+
+- Remove tests
+- Edit test descriptions
+- Modify test steps
+- Combine or consolidate tests
+- Reorder tests
+
+**ONLY CHANGE "passes" FIELD AFTER ALL PLAYWRIGHT TESTS PASS.**
+
+### STEP 8: COMMIT YOUR PROGRESS
+
+Make a descriptive git commit:
+
+```bash
+git add .
+git commit -m "Implement [feature name] - verified with Playwright tests
+
+- Added [specific changes]
+- Added/updated Playwright tests in tests/
+- All tests passing
+- Updated feature_list.json: marked test #X as passing
+"
+git push origin main
+```
+
+### STEP 9: UPDATE PROGRESS NOTES
+
+Update `claude-progress.txt` with:
+
+- What you accomplished this session
+- Which test(s) you completed
+- Any issues discovered or fixed
+- What should be worked on next
+- Current completion status (e.g., "45/200 tests passing")
+
+### STEP 10: END SESSION CLEANLY
+
+Before context fills up:
+
+1. Commit all working code
+2. Update claude-progress.txt
+3. Update feature_list.json if tests verified
+4. Ensure no uncommitted changes
+5. Leave app in working state (no broken features)
+
+---
+
+## TESTING REQUIREMENTS
+
+**ALL testing must use Playwright tests.**
+
+**Setup (if not already done):**
+
+```bash
+# Install Playwright
+npm install -D @playwright/test
+
+# Install browsers
+npx playwright install
+```
+
+**Writing Tests:**
+
+Create tests in the `tests/` directory with `.spec.ts` extension.
+
+```typescript
+// tests/example.spec.ts
+import { test, expect } from "@playwright/test";
+
+test.describe("Feature Name", () => {
+  test("happy path: user completes main workflow", async ({ page }) => {
+    await page.goto("http://localhost:3000");
+
+    // Interact with UI elements
+    await page.click('button[data-testid="action"]');
+    await page.fill('input[data-testid="input"]', "test value");
+
+    // Assert expected outcomes
+    await expect(page.locator('[data-testid="result"]')).toBeVisible();
+  });
+});
+```
+
+**Running Tests:**
+
+```bash
+# Run all tests (fast, headless)
+npx playwright test
+
+# Run specific test file
+npx playwright test tests/feature.spec.ts
+
+# Run with browser visible (for debugging)
+npx playwright test --headed
+
+# Run with UI mode (interactive debugging)
+npx playwright test --ui
+```
+
+**Best Practices:**
+
+- Add `data-testid` attributes to elements for reliable selectors
+- Focus on happy path tests first - they're fast and catch most regressions
+- Keep tests independent and isolated
+- Write tests as you implement features, not after
+
+---
+
+## IMPORTANT REMINDERS
+
+**Your Goal:** Production-quality application with all 200+ tests passing
+
+**This Session's Goal:** Complete at least one feature perfectly
+
+**Priority:** Fix broken tests before implementing new features
+
+**Quality Bar:**
+
+- Zero console errors
+- Polished UI matching the design specified in app_spec.txt (use landing page and generate page for true north of how design should look and be polished)
+- All features work end-to-end through the UI
+- Fast, responsive, professional
+
+**You have unlimited time.** Take as long as needed to get it right. The most important thing is that you
+leave the code base in a clean state before terminating the session (Step 10).
+
+---
+
+Begin by running Step 1 (Get Your Bearings).
--- a/reference/prompts/initializer_prompt.md
+++ b/reference/prompts/initializer_prompt.md
@@ -0,0 +1,106 @@
+## YOUR ROLE - INITIALIZER AGENT (Session 1 of Many)
+
+You are the FIRST agent in a long-running autonomous development process.
+Your job is to set up the foundation for all future coding agents.
+
+### FIRST: Read the Project Specification
+
+Start by reading `app_spec.txt` in your working directory. This file contains
+the complete specification for what you need to build. Read it carefully
+before proceeding.
+
+### CRITICAL FIRST TASK: Create feature_list.json
+
+Based on `app_spec.txt`, create a file called `feature_list.json` with 200 detailed
+end-to-end test cases. This file is the single source of truth for what
+needs to be built.
+
+**Format:**
+```json
+[
+  {
+    "category": "functional",
+    "description": "Brief description of the feature and what this test verifies",
+    "steps": [
+      "Step 1: Navigate to relevant page",
+      "Step 2: Perform action",
+      "Step 3: Verify expected result"
+    ],
+    "passes": false
+  },
+  {
+    "category": "style",
+    "description": "Brief description of UI/UX requirement",
+    "steps": [
+      "Step 1: Navigate to page",
+      "Step 2: Take screenshot",
+      "Step 3: Verify visual requirements"
+    ],
+    "passes": false
+  }
+]
+```
+
+**Requirements for feature_list.json:**
+- Minimum 200 features total with testing steps for each
+- Both "functional" and "style" categories
+- Mix of narrow tests (2-5 steps) and comprehensive tests (10+ steps)
+- At least 25 tests MUST have 10+ steps each
+- Order features by priority: fundamental features first
+- ALL tests start with "passes": false
+- Cover every feature in the spec exhaustively
+
+**CRITICAL INSTRUCTION:**
+IT IS CATASTROPHIC TO REMOVE OR EDIT FEATURES IN FUTURE SESSIONS.
+Features can ONLY be marked as passing (change "passes": false to "passes": true).
+Never remove features, never edit descriptions, never modify testing steps.
+This ensures no functionality is missed.
+
+### SECOND TASK: Create init.sh
+
+Create a script called `init.sh` that future agents can use to quickly
+set up and run the development environment. The script should:
+
+1. Install any required dependencies
+2. Start any necessary servers or services
+3. Print helpful information about how to access the running application
+
+Base the script on the technology stack specified in `app_spec.txt`.
+
+### THIRD TASK: Initialize Git
+
+Create a git repository and make your first commit with:
+- feature_list.json (complete with all 200+ features)
+- init.sh (environment setup script)
+- README.md (project overview and setup instructions)
+
+Commit message: "Initial setup: feature_list.json, init.sh, and project structure"
+
+### FOURTH TASK: Create Project Structure
+
+Set up the basic project structure based on what's specified in `app_spec.txt`.
+This typically includes directories for frontend, backend, and any other
+components mentioned in the spec.
+
+### OPTIONAL: Start Implementation
+
+If you have time remaining in this session, you may begin implementing
+the highest-priority features from feature_list.json. Remember:
+- Work on ONE feature at a time
+- Test thoroughly before marking "passes": true
+- Commit your progress before session ends
+
+### ENDING THIS SESSION
+
+Before your context fills up:
+1. Commit all work with descriptive messages
+2. Create `claude-progress.txt` with a summary of what you accomplished
+3. Ensure feature_list.json is complete and saved
+4. Leave the environment in a clean, working state
+
+The next agent will continue from here with a fresh context window.
+
+---
+
+**Remember:** You have unlimited time across many sessions. Focus on
+quality over speed. Production-ready is the goal.
--- a/reference/requirements.txt
+++ b/reference/requirements.txt
@@ -0,0 +1 @@
+claude-code-sdk>=0.0.25
--- a/reference/security.py
+++ b/reference/security.py
@@ -0,0 +1,370 @@
+"""
+Security Hooks for Autonomous Coding Agent
+==========================================
+
+Pre-tool-use hooks that validate bash commands for security.
+Uses an allowlist approach - only explicitly permitted commands can run.
+"""
+
+import os
+import shlex
+
+
+# Allowed commands for development tasks
+# Minimal set needed for the autonomous coding demo
+ALLOWED_COMMANDS = {
+    # File inspection
+    "ls",
+    "cat",
+    "head",
+    "tail",
+    "wc",
+    "grep",
+    # File operations (agent uses SDK tools for most file ops, but cp/mkdir needed occasionally)
+    "cp",
+    "mkdir",
+    "chmod",  # For making scripts executable; validated separately
+    # Directory
+    "pwd",
+    # Node.js development
+    "npm",
+    "node",
+    # Version control
+    "git",
+    # Process management
+    "ps",
+    "lsof",
+    "sleep",
+    "pkill",  # For killing dev servers; validated separately
+    # Script execution
+    "init.sh",  # Init scripts; validated separately
+    # JSON manipulation
+    "jq",
+    # Networking
+    "curl",
+    # Utility
+    "xargs",
+    "echo",
+    "mv",
+    "cp",
+    "rm",
+    "npx",
+}
+
+# Commands that need additional validation even when in the allowlist
+COMMANDS_NEEDING_EXTRA_VALIDATION = {"pkill", "chmod", "init.sh"}
+
+
+def split_command_segments(command_string: str) -> list[str]:
+    """
+    Split a compound command into individual command segments.
+
+    Handles command chaining (&&, ||, ;) but not pipes (those are single commands).
+
+    Args:
+        command_string: The full shell command
+
+    Returns:
+        List of individual command segments
+    """
+    import re
+
+    # Split on && and || while preserving the ability to handle each segment
+    # This regex splits on && or || that aren't inside quotes
+    segments = re.split(r"\s*(?:&&|\|\|)\s*", command_string)
+
+    # Further split on semicolons
+    result = []
+    for segment in segments:
+        sub_segments = re.split(r'(?<!["\'])\s*;\s*(?!["\'])', segment)
+        for sub in sub_segments:
+            sub = sub.strip()
+            if sub:
+                result.append(sub)
+
+    return result
+
+
+def extract_commands(command_string: str) -> list[str]:
+    """
+    Extract command names from a shell command string.
+
+    Handles pipes, command chaining (&&, ||, ;), and subshells.
+    Returns the base command names (without paths).
+
+    Args:
+        command_string: The full shell command
+
+    Returns:
+        List of command names found in the string
+    """
+    commands = []
+
+    # shlex doesn't treat ; as a separator, so we need to pre-process
+    import re
+
+    # Split on semicolons that aren't inside quotes (simple heuristic)
+    # This handles common cases like "echo hello; ls"
+    segments = re.split(r'(?<!["\'])\s*;\s*(?!["\'])', command_string)
+
+    for segment in segments:
+        segment = segment.strip()
+        if not segment:
+            continue
+
+        try:
+            tokens = shlex.split(segment)
+        except ValueError:
+            # Malformed command (unclosed quotes, etc.)
+            # Return empty to trigger block (fail-safe)
+            return []
+
+        if not tokens:
+            continue
+
+        # Track when we expect a command vs arguments
+        expect_command = True
+
+        for token in tokens:
+            # Shell operators indicate a new command follows
+            if token in ("|", "||", "&&", "&"):
+                expect_command = True
+                continue
+
+            # Skip shell keywords that precede commands
+            if token in (
+                "if",
+                "then",
+                "else",
+                "elif",
+                "fi",
+                "for",
+                "while",
+                "until",
+                "do",
+                "done",
+                "case",
+                "esac",
+                "in",
+                "!",
+                "{",
+                "}",
+            ):
+                continue
+
+            # Skip flags/options
+            if token.startswith("-"):
+                continue
+
+            # Skip variable assignments (VAR=value)
+            if "=" in token and not token.startswith("="):
+                continue
+
+            if expect_command:
+                # Extract the base command name (handle paths like /usr/bin/python)
+                cmd = os.path.basename(token)
+                commands.append(cmd)
+                expect_command = False
+
+    return commands
+
+
+def validate_pkill_command(command_string: str) -> tuple[bool, str]:
+    """
+    Validate pkill commands - only allow killing dev-related processes.
+
+    Uses shlex to parse the command, avoiding regex bypass vulnerabilities.
+
+    Returns:
+        Tuple of (is_allowed, reason_if_blocked)
+    """
+    # Allowed process names for pkill
+    allowed_process_names = {
+        "node",
+        "npm",
+        "npx",
+        "vite",
+        "next",
+    }
+
+    try:
+        tokens = shlex.split(command_string)
+    except ValueError:
+        return False, "Could not parse pkill command"
+
+    if not tokens:
+        return False, "Empty pkill command"
+
+    # Separate flags from arguments
+    args = []
+    for token in tokens[1:]:
+        if not token.startswith("-"):
+            args.append(token)
+
+    if not args:
+        return False, "pkill requires a process name"
+
+    # The target is typically the last non-flag argument
+    target = args[-1]
+
+    # For -f flag (full command line match), extract the first word as process name
+    # e.g., "pkill -f 'node server.js'" -> target is "node server.js", process is "node"
+    if " " in target:
+        target = target.split()[0]
+
+    if target in allowed_process_names:
+        return True, ""
+    return False, f"pkill only allowed for dev processes: {allowed_process_names}"
+
+
+def validate_chmod_command(command_string: str) -> tuple[bool, str]:
+    """
+    Validate chmod commands - only allow making files executable with +x.
+
+    Returns:
+        Tuple of (is_allowed, reason_if_blocked)
+    """
+    try:
+        tokens = shlex.split(command_string)
+    except ValueError:
+        return False, "Could not parse chmod command"
+
+    if not tokens or tokens[0] != "chmod":
+        return False, "Not a chmod command"
+
+    # Look for the mode argument
+    # Valid modes: +x, u+x, a+x, etc. (anything ending with +x for execute permission)
+    mode = None
+    files = []
+
+    for token in tokens[1:]:
+        if token.startswith("-"):
+            # Skip flags like -R (we don't allow recursive chmod anyway)
+            return False, "chmod flags are not allowed"
+        elif mode is None:
+            mode = token
+        else:
+            files.append(token)
+
+    if mode is None:
+        return False, "chmod requires a mode"
+
+    if not files:
+        return False, "chmod requires at least one file"
+
+    # Only allow +x variants (making files executable)
+    # This matches: +x, u+x, g+x, o+x, a+x, ug+x, etc.
+    import re
+
+    if not re.match(r"^[ugoa]*\+x$", mode):
+        return False, f"chmod only allowed with +x mode, got: {mode}"
+
+    return True, ""
+
+
+def validate_init_script(command_string: str) -> tuple[bool, str]:
+    """
+    Validate init.sh script execution - only allow ./init.sh.
+
+    Returns:
+        Tuple of (is_allowed, reason_if_blocked)
+    """
+    try:
+        tokens = shlex.split(command_string)
+    except ValueError:
+        return False, "Could not parse init script command"
+
+    if not tokens:
+        return False, "Empty command"
+
+    # The command should be exactly ./init.sh (possibly with arguments)
+    script = tokens[0]
+
+    # Allow ./init.sh or paths ending in /init.sh
+    if script == "./init.sh" or script.endswith("/init.sh"):
+        return True, ""
+
+    return False, f"Only ./init.sh is allowed, got: {script}"
+
+
+def get_command_for_validation(cmd: str, segments: list[str]) -> str:
+    """
+    Find the specific command segment that contains the given command.
+
+    Args:
+        cmd: The command name to find
+        segments: List of command segments
+
+    Returns:
+        The segment containing the command, or empty string if not found
+    """
+    for segment in segments:
+        segment_commands = extract_commands(segment)
+        if cmd in segment_commands:
+            return segment
+    return ""
+
+
+async def bash_security_hook(input_data, tool_use_id=None, context=None):
+    """
+    Pre-tool-use hook that validates bash commands using an allowlist.
+
+    Only commands in ALLOWED_COMMANDS are permitted.
+
+    Args:
+        input_data: Dict containing tool_name and tool_input
+        tool_use_id: Optional tool use ID
+        context: Optional context
+
+    Returns:
+        Empty dict to allow, or {"decision": "block", "reason": "..."} to block
+    """
+    if input_data.get("tool_name") != "Bash":
+        return {}
+
+    command = input_data.get("tool_input", {}).get("command", "")
+    if not command:
+        return {}
+
+    # Extract all commands from the command string
+    commands = extract_commands(command)
+
+    if not commands:
+        # Could not parse - fail safe by blocking
+        return {
+            "decision": "block",
+            "reason": f"Could not parse command for security validation: {command}",
+        }
+
+    # Split into segments for per-command validation
+    segments = split_command_segments(command)
+
+    # Check each command against the allowlist
+    for cmd in commands:
+        if cmd not in ALLOWED_COMMANDS:
+            return {
+                "decision": "block",
+                "reason": f"Command '{cmd}' is not in the allowed commands list",
+            }
+
+        # Additional validation for sensitive commands
+        if cmd in COMMANDS_NEEDING_EXTRA_VALIDATION:
+            # Find the specific segment containing this command
+            cmd_segment = get_command_for_validation(cmd, segments)
+            if not cmd_segment:
+                cmd_segment = command  # Fallback to full command
+
+            if cmd == "pkill":
+                allowed, reason = validate_pkill_command(cmd_segment)
+                if not allowed:
+                    return {"decision": "block", "reason": reason}
+            elif cmd == "chmod":
+                allowed, reason = validate_chmod_command(cmd_segment)
+                if not allowed:
+                    return {"decision": "block", "reason": reason}
+            elif cmd == "init.sh":
+                allowed, reason = validate_init_script(cmd_segment)
+                if not allowed:
+                    return {"decision": "block", "reason": reason}
+
+    return {}
--- a/reference/test_security.py
+++ b/reference/test_security.py
@@ -0,0 +1,290 @@
+#!/usr/bin/env python3
+"""
+Security Hook Tests
+===================
+
+Tests for the bash command security validation logic.
+Run with: python test_security.py
+"""
+
+import asyncio
+import sys
+
+from security import (
+    bash_security_hook,
+    extract_commands,
+    validate_chmod_command,
+    validate_init_script,
+)
+
+
+def test_hook(command: str, should_block: bool) -> bool:
+    """Test a single command against the security hook."""
+    input_data = {"tool_name": "Bash", "tool_input": {"command": command}}
+    result = asyncio.run(bash_security_hook(input_data))
+    was_blocked = result.get("decision") == "block"
+
+    if was_blocked == should_block:
+        status = "PASS"
+    else:
+        status = "FAIL"
+        expected = "blocked" if should_block else "allowed"
+        actual = "blocked" if was_blocked else "allowed"
+        reason = result.get("reason", "")
+        print(f"  {status}: {command!r}")
+        print(f"         Expected: {expected}, Got: {actual}")
+        if reason:
+            print(f"         Reason: {reason}")
+        return False
+
+    print(f"  {status}: {command!r}")
+    return True
+
+
+def test_extract_commands():
+    """Test the command extraction logic."""
+    print("\nTesting command extraction:\n")
+    passed = 0
+    failed = 0
+
+    test_cases = [
+        ("ls -la", ["ls"]),
+        ("npm install && npm run build", ["npm", "npm"]),
+        ("cat file.txt | grep pattern", ["cat", "grep"]),
+        ("/usr/bin/node script.js", ["node"]),
+        ("VAR=value ls", ["ls"]),
+        ("git status || git init", ["git", "git"]),
+    ]
+
+    for cmd, expected in test_cases:
+        result = extract_commands(cmd)
+        if result == expected:
+            print(f"  PASS: {cmd!r} -> {result}")
+            passed += 1
+        else:
+            print(f"  FAIL: {cmd!r}")
+            print(f"         Expected: {expected}, Got: {result}")
+            failed += 1
+
+    return passed, failed
+
+
+def test_validate_chmod():
+    """Test chmod command validation."""
+    print("\nTesting chmod validation:\n")
+    passed = 0
+    failed = 0
+
+    # Test cases: (command, should_be_allowed, description)
+    test_cases = [
+        # Allowed cases
+        ("chmod +x init.sh", True, "basic +x"),
+        ("chmod +x script.sh", True, "+x on any script"),
+        ("chmod u+x init.sh", True, "user +x"),
+        ("chmod a+x init.sh", True, "all +x"),
+        ("chmod ug+x init.sh", True, "user+group +x"),
+        ("chmod +x file1.sh file2.sh", True, "multiple files"),
+        # Blocked cases
+        ("chmod 777 init.sh", False, "numeric mode"),
+        ("chmod 755 init.sh", False, "numeric mode 755"),
+        ("chmod +w init.sh", False, "write permission"),
+        ("chmod +r init.sh", False, "read permission"),
+        ("chmod -x init.sh", False, "remove execute"),
+        ("chmod -R +x dir/", False, "recursive flag"),
+        ("chmod --recursive +x dir/", False, "long recursive flag"),
+        ("chmod +x", False, "missing file"),
+    ]
+
+    for cmd, should_allow, description in test_cases:
+        allowed, reason = validate_chmod_command(cmd)
+        if allowed == should_allow:
+            print(f"  PASS: {cmd!r} ({description})")
+            passed += 1
+        else:
+            expected = "allowed" if should_allow else "blocked"
+            actual = "allowed" if allowed else "blocked"
+            print(f"  FAIL: {cmd!r} ({description})")
+            print(f"         Expected: {expected}, Got: {actual}")
+            if reason:
+                print(f"         Reason: {reason}")
+            failed += 1
+
+    return passed, failed
+
+
+def test_validate_init_script():
+    """Test init.sh script execution validation."""
+    print("\nTesting init.sh validation:\n")
+    passed = 0
+    failed = 0
+
+    # Test cases: (command, should_be_allowed, description)
+    test_cases = [
+        # Allowed cases
+        ("./init.sh", True, "basic ./init.sh"),
+        ("./init.sh arg1 arg2", True, "with arguments"),
+        ("/path/to/init.sh", True, "absolute path"),
+        ("../dir/init.sh", True, "relative path with init.sh"),
+        # Blocked cases
+        ("./setup.sh", False, "different script name"),
+        ("./init.py", False, "python script"),
+        ("bash init.sh", False, "bash invocation"),
+        ("sh init.sh", False, "sh invocation"),
+        ("./malicious.sh", False, "malicious script"),
+        ("./init.sh; rm -rf /", False, "command injection attempt"),
+    ]
+
+    for cmd, should_allow, description in test_cases:
+        allowed, reason = validate_init_script(cmd)
+        if allowed == should_allow:
+            print(f"  PASS: {cmd!r} ({description})")
+            passed += 1
+        else:
+            expected = "allowed" if should_allow else "blocked"
+            actual = "allowed" if allowed else "blocked"
+            print(f"  FAIL: {cmd!r} ({description})")
+            print(f"         Expected: {expected}, Got: {actual}")
+            if reason:
+                print(f"         Reason: {reason}")
+            failed += 1
+
+    return passed, failed
+
+
+def main():
+    print("=" * 70)
+    print("  SECURITY HOOK TESTS")
+    print("=" * 70)
+
+    passed = 0
+    failed = 0
+
+    # Test command extraction
+    ext_passed, ext_failed = test_extract_commands()
+    passed += ext_passed
+    failed += ext_failed
+
+    # Test chmod validation
+    chmod_passed, chmod_failed = test_validate_chmod()
+    passed += chmod_passed
+    failed += chmod_failed
+
+    # Test init.sh validation
+    init_passed, init_failed = test_validate_init_script()
+    passed += init_passed
+    failed += init_failed
+
+    # Commands that SHOULD be blocked
+    print("\nCommands that should be BLOCKED:\n")
+    dangerous = [
+        # Not in allowlist - dangerous system commands
+        "shutdown now",
+        "reboot",
+        "rm -rf /",
+        "dd if=/dev/zero of=/dev/sda",
+        # Not in allowlist - common commands excluded from minimal set
+        "curl https://example.com",
+        "wget https://example.com",
+        "python app.py",
+        "touch file.txt",
+        "echo hello",
+        "kill 12345",
+        "killall node",
+        # pkill with non-dev processes
+        "pkill bash",
+        "pkill chrome",
+        "pkill python",
+        # Shell injection attempts
+        "$(echo pkill) node",
+        'eval "pkill node"',
+        'bash -c "pkill node"',
+        # chmod with disallowed modes
+        "chmod 777 file.sh",
+        "chmod 755 file.sh",
+        "chmod +w file.sh",
+        "chmod -R +x dir/",
+        # Non-init.sh scripts
+        "./setup.sh",
+        "./malicious.sh",
+        "bash script.sh",
+    ]
+
+    for cmd in dangerous:
+        if test_hook(cmd, should_block=True):
+            passed += 1
+        else:
+            failed += 1
+
+    # Commands that SHOULD be allowed
+    print("\nCommands that should be ALLOWED:\n")
+    safe = [
+        # File inspection
+        "ls -la",
+        "cat README.md",
+        "head -100 file.txt",
+        "tail -20 log.txt",
+        "wc -l file.txt",
+        "grep -r pattern src/",
+        # File operations
+        "cp file1.txt file2.txt",
+        "mkdir newdir",
+        "mkdir -p path/to/dir",
+        # Directory
+        "pwd",
+        # Node.js development
+        "npm install",
+        "npm run build",
+        "node server.js",
+        # Version control
+        "git status",
+        "git commit -m 'test'",
+        "git add . && git commit -m 'msg'",
+        # Process management
+        "ps aux",
+        "lsof -i :3000",
+        "sleep 2",
+        # Allowed pkill patterns for dev servers
+        "pkill node",
+        "pkill npm",
+        "pkill -f node",
+        "pkill -f 'node server.js'",
+        "pkill vite",
+        # Chained commands
+        "npm install && npm run build",
+        "ls | grep test",
+        # Full paths
+        "/usr/local/bin/node app.js",
+        # chmod +x (allowed)
+        "chmod +x init.sh",
+        "chmod +x script.sh",
+        "chmod u+x init.sh",
+        "chmod a+x init.sh",
+        # init.sh execution (allowed)
+        "./init.sh",
+        "./init.sh --production",
+        "/path/to/init.sh",
+        # Combined chmod and init.sh
+        "chmod +x init.sh && ./init.sh",
+    ]
+
+    for cmd in safe:
+        if test_hook(cmd, should_block=False):
+            passed += 1
+        else:
+            failed += 1
+
+    # Summary
+    print("\n" + "-" * 70)
+    print(f"  Results: {passed} passed, {failed} failed")
+    print("-" * 70)
+
+    if failed == 0:
+        print("\n  ALL TESTS PASSED")
+        return 0
+    else:
+        print(f"\n  {failed} TEST(S) FAILED")
+        return 1
+
+
+if __name__ == "__main__":
+    sys.exit(main())