mirror of
https://github.com/AutoMaker-Org/automaker.git
synced 2026-01-31 06:42:03 +00:00
initial commit
This commit is contained in:
4
reference/.gitignore
vendored
Normal file
4
reference/.gitignore
vendored
Normal file
@@ -0,0 +1,4 @@
|
||||
# Agent-generated output directories
|
||||
|
||||
# Log files
|
||||
logs/
|
||||
163
reference/README.md
Normal file
163
reference/README.md
Normal file
@@ -0,0 +1,163 @@
|
||||
# Autonomous Coding Agent Demo
|
||||
|
||||
A minimal harness demonstrating long-running autonomous coding with the Claude Agent SDK. This demo implements a two-agent pattern (initializer + coding agent) that can build complete applications over multiple sessions.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
**Required:** Install the latest versions of both Claude Code and the Claude Agent SDK:
|
||||
|
||||
```bash
|
||||
# Install Claude Code CLI (latest version required)
|
||||
npm install -g @anthropic-ai/claude-code
|
||||
|
||||
# Install Python dependencies
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
Verify your installations:
|
||||
```bash
|
||||
claude --version # Should be latest version
|
||||
pip show claude-code-sdk # Check SDK is installed
|
||||
```
|
||||
|
||||
**API Key:** Set your Anthropic API key:
|
||||
```bash
|
||||
export ANTHROPIC_API_KEY='your-api-key-here'
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
python autonomous_agent_demo.py --project-dir ./my_project
|
||||
```
|
||||
|
||||
For testing with limited iterations:
|
||||
```bash
|
||||
python autonomous_agent_demo.py --project-dir ./my_project --max-iterations 3
|
||||
```
|
||||
|
||||
## Important Timing Expectations
|
||||
|
||||
> **Warning: This demo takes a long time to run!**
|
||||
|
||||
- **First session (initialization):** The agent generates a `feature_list.json` with 200 test cases. This takes several minutes and may appear to hang - this is normal. The agent is writing out all the features.
|
||||
|
||||
- **Subsequent sessions:** Each coding iteration can take **5-15 minutes** depending on complexity.
|
||||
|
||||
- **Full app:** Building all 200 features typically requires **many hours** of total runtime across multiple sessions.
|
||||
|
||||
**Tip:** The 200 features parameter in the prompts is designed for comprehensive coverage. If you want faster demos, you can modify `prompts/initializer_prompt.md` to reduce the feature count (e.g., 20-50 features for a quicker demo).
|
||||
|
||||
## How It Works
|
||||
|
||||
### Two-Agent Pattern
|
||||
|
||||
1. **Initializer Agent (Session 1):** Reads `app_spec.txt`, creates `feature_list.json` with 200 test cases, sets up project structure, and initializes git.
|
||||
|
||||
2. **Coding Agent (Sessions 2+):** Picks up where the previous session left off, implements features one by one, and marks them as passing in `feature_list.json`.
|
||||
|
||||
### Session Management
|
||||
|
||||
- Each session runs with a fresh context window
|
||||
- Progress is persisted via `feature_list.json` and git commits
|
||||
- The agent auto-continues between sessions (3 second delay)
|
||||
- Press `Ctrl+C` to pause; run the same command to resume
|
||||
|
||||
## Security Model
|
||||
|
||||
This demo uses a defense-in-depth security approach (see `security.py` and `client.py`):
|
||||
|
||||
1. **OS-level Sandbox:** Bash commands run in an isolated environment
|
||||
2. **Filesystem Restrictions:** File operations restricted to the project directory only
|
||||
3. **Bash Allowlist:** Only specific commands are permitted:
|
||||
- File inspection: `ls`, `cat`, `head`, `tail`, `wc`, `grep`
|
||||
- Node.js: `npm`, `node`
|
||||
- Version control: `git`
|
||||
- Process management: `ps`, `lsof`, `sleep`, `pkill` (dev processes only)
|
||||
|
||||
Commands not in the allowlist are blocked by the security hook.
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
autonomous-coding/
|
||||
├── autonomous_agent_demo.py # Main entry point
|
||||
├── agent.py # Agent session logic
|
||||
├── client.py # Claude SDK client configuration
|
||||
├── security.py # Bash command allowlist and validation
|
||||
├── progress.py # Progress tracking utilities
|
||||
├── prompts.py # Prompt loading utilities
|
||||
├── prompts/
|
||||
│ ├── app_spec.txt # Application specification
|
||||
│ ├── initializer_prompt.md # First session prompt
|
||||
│ └── coding_prompt.md # Continuation session prompt
|
||||
└── requirements.txt # Python dependencies
|
||||
```
|
||||
|
||||
## Generated Project Structure
|
||||
|
||||
After running, your project directory will contain:
|
||||
|
||||
```
|
||||
my_project/
|
||||
├── feature_list.json # Test cases (source of truth)
|
||||
├── app_spec.txt # Copied specification
|
||||
├── init.sh # Environment setup script
|
||||
├── claude-progress.txt # Session progress notes
|
||||
├── .claude_settings.json # Security settings
|
||||
└── [application files] # Generated application code
|
||||
```
|
||||
|
||||
## Running the Generated Application
|
||||
|
||||
After the agent completes (or pauses), you can run the generated application:
|
||||
|
||||
```bash
|
||||
cd generations/my_project
|
||||
|
||||
# Run the setup script created by the agent
|
||||
./init.sh
|
||||
|
||||
# Or manually (typical for Node.js apps):
|
||||
npm install
|
||||
npm run dev
|
||||
```
|
||||
|
||||
The application will typically be available at `http://localhost:3000` or similar (check the agent's output or `init.sh` for the exact URL).
|
||||
|
||||
## Command Line Options
|
||||
|
||||
| Option | Description | Default |
|
||||
|--------|-------------|---------|
|
||||
| `--project-dir` | Directory for the project | `./autonomous_demo_project` |
|
||||
| `--max-iterations` | Max agent iterations | Unlimited |
|
||||
| `--model` | Claude model to use | `claude-sonnet-4-5-20250929` |
|
||||
|
||||
## Customization
|
||||
|
||||
### Changing the Application
|
||||
|
||||
Edit `prompts/app_spec.txt` to specify a different application to build.
|
||||
|
||||
### Adjusting Feature Count
|
||||
|
||||
Edit `prompts/initializer_prompt.md` and change the "200 features" requirement to a smaller number for faster demos.
|
||||
|
||||
### Modifying Allowed Commands
|
||||
|
||||
Edit `security.py` to add or remove commands from `ALLOWED_COMMANDS`.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**"Appears to hang on first run"**
|
||||
This is normal. The initializer agent is generating 200 detailed test cases, which takes significant time. Watch for `[Tool: ...]` output to confirm the agent is working.
|
||||
|
||||
**"Command blocked by security hook"**
|
||||
The agent tried to run a command not in the allowlist. This is the security system working as intended. If needed, add the command to `ALLOWED_COMMANDS` in `security.py`.
|
||||
|
||||
**"API key not set"**
|
||||
Ensure `ANTHROPIC_API_KEY` is exported in your shell environment.
|
||||
|
||||
## License
|
||||
|
||||
Internal Anthropic use.
|
||||
99
reference/SETUP.md
Normal file
99
reference/SETUP.md
Normal file
@@ -0,0 +1,99 @@
|
||||
# Autonomous Coding Agent Setup
|
||||
|
||||
This autonomous coding agent now uses the **Claude Code CLI directly** instead of the Python SDK.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
1. **Claude Code** must be installed on your system
|
||||
2. You must authenticate Claude Code for **headless mode** (--print flag)
|
||||
|
||||
## Authentication Setup
|
||||
|
||||
The `--print` (headless) mode requires a long-lived authentication token. To set this up:
|
||||
|
||||
### Option 1: Setup Token (Recommended)
|
||||
|
||||
Run this command in your own terminal (requires Claude subscription):
|
||||
|
||||
```bash
|
||||
claude setup-token
|
||||
```
|
||||
|
||||
This will open your browser and authenticate Claude Code for headless usage.
|
||||
|
||||
### Option 2: Use API Key
|
||||
|
||||
If you have an Anthropic API key instead:
|
||||
|
||||
```bash
|
||||
export ANTHROPIC_API_KEY='your-api-key-here'
|
||||
```
|
||||
|
||||
Or for OAuth tokens:
|
||||
|
||||
```bash
|
||||
export CLAUDE_CODE_OAUTH_TOKEN='your-oauth-token-here'
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
Once authenticated, run:
|
||||
|
||||
```bash
|
||||
python3 autonomous_agent_demo.py --project-dir ./my_project --max-iterations 3
|
||||
```
|
||||
|
||||
### Options:
|
||||
|
||||
- `--project-dir`: Directory for your project (default: `./autonomous_demo_project`)
|
||||
- `--max-iterations`: Maximum number of agent iterations (default: unlimited)
|
||||
- `--model`: Claude model to use (default: `opus` for Opus 4.5)
|
||||
|
||||
### Examples:
|
||||
|
||||
```bash
|
||||
# Start a new project with Opus 4.5
|
||||
python3 autonomous_agent_demo.py --project-dir ./my_app
|
||||
|
||||
# Limit iterations for testing
|
||||
python3 autonomous_agent_demo.py --project-dir ./my_app --max-iterations 5
|
||||
|
||||
# Use a different model
|
||||
python3 autonomous_agent_demo.py --project-dir ./my_app --model sonnet
|
||||
```
|
||||
|
||||
## How It Works
|
||||
|
||||
The agent:
|
||||
|
||||
1. Creates configuration files (`.claude_settings.json`, `.mcp_config.json`)
|
||||
2. Calls `claude --print` with your prompt
|
||||
3. Captures the output and continues the autonomous loop
|
||||
4. Uses your existing Claude Code authentication
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "Invalid API key" Error
|
||||
|
||||
This means Claude Code isn't authenticated for headless mode. Run:
|
||||
|
||||
```bash
|
||||
claude setup-token
|
||||
```
|
||||
|
||||
### Check Authentication Status
|
||||
|
||||
Test if headless mode works:
|
||||
|
||||
```bash
|
||||
echo "Hello" | claude --print --model opus
|
||||
```
|
||||
|
||||
If this works, the autonomous agent will work too.
|
||||
|
||||
### Still Having Issues?
|
||||
|
||||
1. Make sure Claude Code is installed: `claude --version`
|
||||
2. Check that you can run Claude normally: `claude`
|
||||
3. Verify `claude` is in your PATH: `which claude`
|
||||
4. Try re-authenticating: `claude setup-token`
|
||||
206
reference/agent.py
Normal file
206
reference/agent.py
Normal file
@@ -0,0 +1,206 @@
|
||||
"""
|
||||
Agent Session Logic
|
||||
===================
|
||||
|
||||
Core agent interaction functions for running autonomous coding sessions.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
from claude_code_sdk import ClaudeSDKClient
|
||||
|
||||
from client import create_client
|
||||
from progress import print_session_header, print_progress_summary
|
||||
from prompts import get_initializer_prompt, get_coding_prompt, copy_spec_to_project
|
||||
|
||||
|
||||
# Configuration
|
||||
AUTO_CONTINUE_DELAY_SECONDS = 3
|
||||
|
||||
|
||||
async def run_agent_session(
|
||||
client: ClaudeSDKClient,
|
||||
message: str,
|
||||
project_dir: Path,
|
||||
) -> tuple[str, str]:
|
||||
"""
|
||||
Run a single agent session using Claude Agent SDK.
|
||||
|
||||
Args:
|
||||
client: Claude SDK client
|
||||
message: The prompt to send
|
||||
project_dir: Project directory path
|
||||
|
||||
Returns:
|
||||
(status, response_text) where status is:
|
||||
- "continue" if agent should continue working
|
||||
- "error" if an error occurred
|
||||
"""
|
||||
print("Sending prompt to Claude Agent SDK...\n")
|
||||
|
||||
try:
|
||||
# Send the query
|
||||
await client.query(message)
|
||||
|
||||
# Collect response text and show tool use
|
||||
response_text = ""
|
||||
async for msg in client.receive_response():
|
||||
msg_type = type(msg).__name__
|
||||
|
||||
# Handle AssistantMessage (text and tool use)
|
||||
if msg_type == "AssistantMessage" and hasattr(msg, "content"):
|
||||
for block in msg.content:
|
||||
block_type = type(block).__name__
|
||||
|
||||
if block_type == "TextBlock" and hasattr(block, "text"):
|
||||
response_text += block.text
|
||||
print(block.text, end="", flush=True)
|
||||
elif block_type == "ToolUseBlock" and hasattr(block, "name"):
|
||||
print(f"\n[Tool: {block.name}]", flush=True)
|
||||
if hasattr(block, "input"):
|
||||
input_str = str(block.input)
|
||||
if len(input_str) > 200:
|
||||
print(f" Input: {input_str[:200]}...", flush=True)
|
||||
else:
|
||||
print(f" Input: {input_str}", flush=True)
|
||||
|
||||
# Handle UserMessage (tool results)
|
||||
elif msg_type == "UserMessage" and hasattr(msg, "content"):
|
||||
for block in msg.content:
|
||||
block_type = type(block).__name__
|
||||
|
||||
if block_type == "ToolResultBlock":
|
||||
result_content = getattr(block, "content", "")
|
||||
is_error = getattr(block, "is_error", False)
|
||||
|
||||
# Check if command was blocked by security hook
|
||||
if "blocked" in str(result_content).lower():
|
||||
print(f" [BLOCKED] {result_content}", flush=True)
|
||||
elif is_error:
|
||||
# Show errors (truncated)
|
||||
error_str = str(result_content)[:500]
|
||||
print(f" [Error] {error_str}", flush=True)
|
||||
else:
|
||||
# Tool succeeded - just show brief confirmation
|
||||
print(" [Done]", flush=True)
|
||||
|
||||
print("\n" + "-" * 70 + "\n")
|
||||
return "continue", response_text
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error during agent session: {e}")
|
||||
return "error", str(e)
|
||||
|
||||
|
||||
async def run_autonomous_agent(
|
||||
project_dir: Path,
|
||||
model: str,
|
||||
max_iterations: Optional[int] = None,
|
||||
) -> None:
|
||||
"""
|
||||
Run the autonomous agent loop.
|
||||
|
||||
Args:
|
||||
project_dir: Directory for the project
|
||||
model: Claude model to use
|
||||
max_iterations: Maximum number of iterations (None for unlimited)
|
||||
"""
|
||||
print("\n" + "=" * 70)
|
||||
print(" AUTONOMOUS CODING AGENT DEMO")
|
||||
print("=" * 70)
|
||||
print(f"\nProject directory: {project_dir}")
|
||||
print(f"Model: {model}")
|
||||
if max_iterations:
|
||||
print(f"Max iterations: {max_iterations}")
|
||||
else:
|
||||
print("Max iterations: Unlimited (will run until completion)")
|
||||
print()
|
||||
|
||||
# Create project directory
|
||||
project_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Check if this is a fresh start or continuation
|
||||
tests_file = project_dir / "feature_list.json"
|
||||
is_first_run = not tests_file.exists()
|
||||
|
||||
if is_first_run:
|
||||
print("Fresh start - will use initializer agent")
|
||||
print()
|
||||
print("=" * 70)
|
||||
print(" NOTE: First session takes 10-20+ minutes!")
|
||||
print(" The agent is generating 200 detailed test cases.")
|
||||
print(" This may appear to hang - it's working. Watch for [Tool: ...] output.")
|
||||
print("=" * 70)
|
||||
print()
|
||||
# Copy the app spec into the project directory for the agent to read
|
||||
copy_spec_to_project(project_dir)
|
||||
else:
|
||||
print("Continuing existing project")
|
||||
print_progress_summary(project_dir)
|
||||
|
||||
# Main loop
|
||||
iteration = 0
|
||||
|
||||
while True:
|
||||
iteration += 1
|
||||
|
||||
# Check max iterations
|
||||
if max_iterations and iteration > max_iterations:
|
||||
print(f"\nReached max iterations ({max_iterations})")
|
||||
print("To continue, run the script again without --max-iterations")
|
||||
break
|
||||
|
||||
# Print session header
|
||||
print_session_header(iteration, is_first_run)
|
||||
|
||||
# Create client (fresh context)
|
||||
client = create_client(project_dir, model)
|
||||
|
||||
# Choose prompt based on session type
|
||||
if is_first_run:
|
||||
prompt = get_initializer_prompt()
|
||||
is_first_run = False # Only use initializer once
|
||||
else:
|
||||
prompt = get_coding_prompt()
|
||||
|
||||
# Run session with async context manager
|
||||
async with client:
|
||||
status, response = await run_agent_session(client, prompt, project_dir)
|
||||
|
||||
# Handle status
|
||||
if status == "continue":
|
||||
print(f"\nAgent will auto-continue in {AUTO_CONTINUE_DELAY_SECONDS}s...")
|
||||
print_progress_summary(project_dir)
|
||||
await asyncio.sleep(AUTO_CONTINUE_DELAY_SECONDS)
|
||||
|
||||
elif status == "error":
|
||||
print("\nSession encountered an error")
|
||||
print("Will retry with a fresh session...")
|
||||
await asyncio.sleep(AUTO_CONTINUE_DELAY_SECONDS)
|
||||
|
||||
# Small delay between sessions
|
||||
if max_iterations is None or iteration < max_iterations:
|
||||
print("\nPreparing next session...\n")
|
||||
await asyncio.sleep(1)
|
||||
|
||||
# Final summary
|
||||
print("\n" + "=" * 70)
|
||||
print(" SESSION COMPLETE")
|
||||
print("=" * 70)
|
||||
print(f"\nProject directory: {project_dir}")
|
||||
print_progress_summary(project_dir)
|
||||
|
||||
# Print instructions for running the generated application
|
||||
print("\n" + "-" * 70)
|
||||
print(" TO RUN THE GENERATED APPLICATION:")
|
||||
print("-" * 70)
|
||||
print(f"\n cd {project_dir.resolve()}")
|
||||
print(" ./init.sh # Run the setup script")
|
||||
print(" # Or manually:")
|
||||
print(" npm install && npm run dev")
|
||||
print("\n Then open http://localhost:3000 (or check init.sh for the URL)")
|
||||
print("-" * 70)
|
||||
|
||||
print("\nDone!")
|
||||
123
reference/autonomous_agent_demo.py
Executable file
123
reference/autonomous_agent_demo.py
Executable file
@@ -0,0 +1,123 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Autonomous Coding Agent Demo
|
||||
============================
|
||||
|
||||
A minimal harness demonstrating long-running autonomous coding with Claude.
|
||||
This script implements the two-agent pattern (initializer + coding agent) and
|
||||
incorporates all the strategies from the long-running agents guide.
|
||||
|
||||
Example Usage:
|
||||
python autonomous_agent_demo.py --project-dir ./claude_clone_demo
|
||||
python autonomous_agent_demo.py --project-dir ./claude_clone_demo --max-iterations 5
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import asyncio
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
from agent import run_autonomous_agent
|
||||
|
||||
|
||||
# Configuration
|
||||
# DEFAULT_MODEL = "claude-haiku-4-5-20251001"
|
||||
# DEFAULT_MODEL = "claude-sonnet-4-5-20250929"
|
||||
DEFAULT_MODEL = "claude-opus-4-5-20251101"
|
||||
|
||||
|
||||
def parse_args() -> argparse.Namespace:
|
||||
"""Parse command line arguments."""
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Autonomous Coding Agent Demo - Long-running agent harness",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Start fresh project
|
||||
python autonomous_agent_demo.py --project-dir ./claude_clone
|
||||
|
||||
# Use a specific model
|
||||
python autonomous_agent_demo.py --project-dir ./claude_clone --model claude-sonnet-4-5-20250929
|
||||
|
||||
# Limit iterations for testing
|
||||
python autonomous_agent_demo.py --project-dir ./claude_clone --max-iterations 5
|
||||
|
||||
# Continue existing project
|
||||
python autonomous_agent_demo.py --project-dir ./claude_clone
|
||||
|
||||
Environment Variables:
|
||||
ANTHROPIC_API_KEY Your Anthropic API key (required)
|
||||
""",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--project-dir",
|
||||
type=Path,
|
||||
default=Path("./autonomous_demo_project"),
|
||||
help="Directory for the project (default: generations/autonomous_demo_project). Relative paths automatically placed in generations/ directory.",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--max-iterations",
|
||||
type=int,
|
||||
default=None,
|
||||
help="Maximum number of agent iterations (default: unlimited)",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--model",
|
||||
type=str,
|
||||
default=DEFAULT_MODEL,
|
||||
help=f"Claude model to use (default: {DEFAULT_MODEL})",
|
||||
)
|
||||
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
def main() -> None:
|
||||
"""Main entry point."""
|
||||
args = parse_args()
|
||||
|
||||
# Check for auth: allow either API key or Claude Code auth token
|
||||
has_api_key = bool(os.environ.get("ANTHROPIC_API_KEY"))
|
||||
has_oauth_token = bool(os.environ.get("CLAUDE_CODE_OAUTH_TOKEN"))
|
||||
|
||||
if not (has_api_key or has_oauth_token):
|
||||
print("Error: No Claude auth configured.")
|
||||
print("\nSet ONE of the following:")
|
||||
print(" # Standard API key from console.anthropic.com")
|
||||
print(" export ANTHROPIC_API_KEY='your-api-key-here'")
|
||||
print("\n # Or, your Claude Code auth token (from `claude setup-token`)")
|
||||
print(" export CLAUDE_CODE_OAUTH_TOKEN='your-claude-code-auth-token'")
|
||||
return
|
||||
|
||||
# Automatically place projects in generations/ directory unless already specified
|
||||
project_dir = args.project_dir
|
||||
if not str(project_dir).startswith("generations/"):
|
||||
# Convert relative paths to be under generations/
|
||||
if project_dir.is_absolute():
|
||||
# If absolute path, use as-is
|
||||
pass
|
||||
else:
|
||||
# Prepend generations/ to relative paths
|
||||
project_dir = Path("generations") / project_dir
|
||||
|
||||
# Run the agent
|
||||
try:
|
||||
asyncio.run(
|
||||
run_autonomous_agent(
|
||||
project_dir=project_dir,
|
||||
model=args.model,
|
||||
max_iterations=args.max_iterations,
|
||||
)
|
||||
)
|
||||
except KeyboardInterrupt:
|
||||
print("\n\nInterrupted by user")
|
||||
print("To resume, run the same command again")
|
||||
except Exception as e:
|
||||
print(f"\nFatal error: {e}")
|
||||
raise
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
130
reference/client.py
Normal file
130
reference/client.py
Normal file
@@ -0,0 +1,130 @@
|
||||
"""
|
||||
Claude SDK Client Configuration
|
||||
===============================
|
||||
|
||||
Functions for creating and configuring the Claude Agent SDK client.
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
from claude_code_sdk import ClaudeCodeOptions, ClaudeSDKClient
|
||||
from claude_code_sdk.types import HookMatcher
|
||||
|
||||
from security import bash_security_hook
|
||||
|
||||
|
||||
# Puppeteer MCP tools for browser automation
|
||||
PUPPETEER_TOOLS = [
|
||||
"mcp__puppeteer__puppeteer_navigate",
|
||||
"mcp__puppeteer__puppeteer_screenshot",
|
||||
"mcp__puppeteer__puppeteer_click",
|
||||
"mcp__puppeteer__puppeteer_fill",
|
||||
"mcp__puppeteer__puppeteer_select",
|
||||
"mcp__puppeteer__puppeteer_hover",
|
||||
"mcp__puppeteer__puppeteer_evaluate",
|
||||
]
|
||||
|
||||
# Built-in tools
|
||||
BUILTIN_TOOLS = [
|
||||
"Read",
|
||||
"Write",
|
||||
"Edit",
|
||||
"Glob",
|
||||
"Grep",
|
||||
"Bash",
|
||||
]
|
||||
|
||||
|
||||
def create_client(project_dir: Path, model: str) -> ClaudeSDKClient:
|
||||
"""Create a Claude Agent SDK client with multi-layered security.
|
||||
|
||||
Auth options
|
||||
------------
|
||||
This demo supports two ways of authenticating:
|
||||
1. API key via ``ANTHROPIC_API_KEY`` (standard Claude API key)
|
||||
2. Claude Code auth token via ``CLAUDE_CODE_OAUTH_TOKEN``
|
||||
|
||||
If neither is set, client creation will fail with a clear error.
|
||||
|
||||
Args:
|
||||
project_dir: Directory for the project
|
||||
model: Claude model to use
|
||||
|
||||
Returns:
|
||||
Configured ClaudeSDKClient
|
||||
|
||||
Security layers (defense in depth):
|
||||
1. Sandbox - OS-level bash command isolation prevents filesystem escape
|
||||
2. Permissions - File operations restricted to project_dir only
|
||||
3. Security hooks - Bash commands validated against an allowlist
|
||||
(see security.py for ALLOWED_COMMANDS)
|
||||
"""
|
||||
api_key = os.environ.get("ANTHROPIC_API_KEY")
|
||||
oauth_token = os.environ.get("CLAUDE_CODE_OAUTH_TOKEN")
|
||||
if not api_key and not oauth_token:
|
||||
raise ValueError(
|
||||
"No Claude auth configured. Set either ANTHROPIC_API_KEY (Claude API key) "
|
||||
"or CLAUDE_CODE_OAUTH_TOKEN (Claude Code auth token from `claude setup-token`)."
|
||||
)
|
||||
|
||||
# Create comprehensive security settings
|
||||
# Note: Using relative paths ("./**") restricts access to project directory
|
||||
# since cwd is set to project_dir
|
||||
security_settings = {
|
||||
"sandbox": {"enabled": True, "autoAllowBashIfSandboxed": True},
|
||||
"permissions": {
|
||||
"defaultMode": "acceptEdits", # Auto-approve edits within allowed directories
|
||||
"allow": [
|
||||
# Allow all file operations within the project directory
|
||||
"Read(./**)",
|
||||
"Write(./**)",
|
||||
"Edit(./**)",
|
||||
"Glob(./**)",
|
||||
"Grep(./**)",
|
||||
# Bash permission granted here, but actual commands are validated
|
||||
# by the bash_security_hook (see security.py for allowed commands)
|
||||
"Bash(*)",
|
||||
# Allow Puppeteer MCP tools for browser automation
|
||||
*PUPPETEER_TOOLS,
|
||||
],
|
||||
},
|
||||
}
|
||||
|
||||
# Ensure project directory exists before creating settings file
|
||||
project_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Write settings to a file in the project directory
|
||||
settings_file = project_dir / ".claude_settings.json"
|
||||
with open(settings_file, "w") as f:
|
||||
json.dump(security_settings, f, indent=2)
|
||||
|
||||
print(f"Created security settings at {settings_file}")
|
||||
print(" - Sandbox enabled (OS-level bash isolation)")
|
||||
print(f" - Filesystem restricted to: {project_dir.resolve()}")
|
||||
print(" - Bash commands restricted to allowlist (see security.py)")
|
||||
print(" - MCP servers: puppeteer (browser automation)")
|
||||
print()
|
||||
|
||||
return ClaudeSDKClient(
|
||||
options=ClaudeCodeOptions(
|
||||
model=model,
|
||||
system_prompt="You are an expert full-stack developer building a production-quality web application.",
|
||||
allowed_tools=[
|
||||
*BUILTIN_TOOLS,
|
||||
*PUPPETEER_TOOLS,
|
||||
],
|
||||
mcp_servers={
|
||||
"puppeteer": {"command": "npx", "args": ["puppeteer-mcp-server"]}
|
||||
},
|
||||
hooks={
|
||||
"PreToolUse": [
|
||||
HookMatcher(matcher="Bash", hooks=[bash_security_hook]),
|
||||
],
|
||||
},
|
||||
max_turns=1000,
|
||||
cwd=str(project_dir.resolve()),
|
||||
settings=str(settings_file.resolve()), # Use absolute path
|
||||
)
|
||||
)
|
||||
57
reference/progress.py
Normal file
57
reference/progress.py
Normal file
@@ -0,0 +1,57 @@
|
||||
"""
|
||||
Progress Tracking Utilities
|
||||
===========================
|
||||
|
||||
Functions for tracking and displaying progress of the autonomous coding agent.
|
||||
"""
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def count_passing_tests(project_dir: Path) -> tuple[int, int]:
|
||||
"""
|
||||
Count passing and total tests in feature_list.json.
|
||||
|
||||
Args:
|
||||
project_dir: Directory containing feature_list.json
|
||||
|
||||
Returns:
|
||||
(passing_count, total_count)
|
||||
"""
|
||||
tests_file = project_dir / "feature_list.json"
|
||||
|
||||
if not tests_file.exists():
|
||||
return 0, 0
|
||||
|
||||
try:
|
||||
with open(tests_file, "r") as f:
|
||||
tests = json.load(f)
|
||||
|
||||
total = len(tests)
|
||||
passing = sum(1 for test in tests if test.get("passes", False))
|
||||
|
||||
return passing, total
|
||||
except (json.JSONDecodeError, IOError):
|
||||
return 0, 0
|
||||
|
||||
|
||||
def print_session_header(session_num: int, is_initializer: bool) -> None:
|
||||
"""Print a formatted header for the session."""
|
||||
session_type = "INITIALIZER" if is_initializer else "CODING AGENT"
|
||||
|
||||
print("\n" + "=" * 70)
|
||||
print(f" SESSION {session_num}: {session_type}")
|
||||
print("=" * 70)
|
||||
print()
|
||||
|
||||
|
||||
def print_progress_summary(project_dir: Path) -> None:
|
||||
"""Print a summary of current progress."""
|
||||
passing, total = count_passing_tests(project_dir)
|
||||
|
||||
if total > 0:
|
||||
percentage = (passing / total) * 100
|
||||
print(f"\nProgress: {passing}/{total} tests passing ({percentage:.1f}%)")
|
||||
else:
|
||||
print("\nProgress: feature_list.json not yet created")
|
||||
37
reference/prompts.py
Normal file
37
reference/prompts.py
Normal file
@@ -0,0 +1,37 @@
|
||||
"""
|
||||
Prompt Loading Utilities
|
||||
========================
|
||||
|
||||
Functions for loading prompt templates from the prompts directory.
|
||||
"""
|
||||
|
||||
import shutil
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
PROMPTS_DIR = Path(__file__).parent / "prompts"
|
||||
|
||||
|
||||
def load_prompt(name: str) -> str:
|
||||
"""Load a prompt template from the prompts directory."""
|
||||
prompt_path = PROMPTS_DIR / f"{name}.md"
|
||||
return prompt_path.read_text()
|
||||
|
||||
|
||||
def get_initializer_prompt() -> str:
|
||||
"""Load the initializer prompt."""
|
||||
return load_prompt("initializer_prompt")
|
||||
|
||||
|
||||
def get_coding_prompt() -> str:
|
||||
"""Load the coding agent prompt."""
|
||||
return load_prompt("coding_prompt")
|
||||
|
||||
|
||||
def copy_spec_to_project(project_dir: Path) -> None:
|
||||
"""Copy the app spec file into the project directory for the agent to read."""
|
||||
spec_source = PROMPTS_DIR / "app_spec.txt"
|
||||
spec_dest = project_dir / "app_spec.txt"
|
||||
if not spec_dest.exists():
|
||||
shutil.copy(spec_source, spec_dest)
|
||||
print("Copied app_spec.txt to project directory")
|
||||
1362
reference/prompts/app_spec.txt
Normal file
1362
reference/prompts/app_spec.txt
Normal file
File diff suppressed because it is too large
Load Diff
291
reference/prompts/coding_prompt.md
Normal file
291
reference/prompts/coding_prompt.md
Normal file
@@ -0,0 +1,291 @@
|
||||
## YOUR ROLE - CODING AGENT
|
||||
|
||||
You are continuing work on a long-running autonomous development task.
|
||||
This is a FRESH context window - you have no memory of previous sessions.
|
||||
|
||||
### STEP 1: GET YOUR BEARINGS (MANDATORY)
|
||||
|
||||
Start by orienting yourself:
|
||||
|
||||
```bash
|
||||
# 1. See your working directory
|
||||
pwd
|
||||
|
||||
# 2. List files to understand project structure
|
||||
ls -la
|
||||
|
||||
# 3. Read the project specification to understand what you're building
|
||||
cat app_spec.txt
|
||||
|
||||
# 4. Read the feature list to see all work
|
||||
cat feature_list.json | head -50
|
||||
|
||||
# 5. Read progress notes from previous sessions
|
||||
cat claude-progress.txt
|
||||
|
||||
# 6. Check recent git history
|
||||
git log --oneline -20
|
||||
|
||||
# 7. Count remaining tests
|
||||
cat feature_list.json | grep '"passes": false' | wc -l
|
||||
```
|
||||
|
||||
Understanding the `app_spec.txt` is critical - it contains the full requirements
|
||||
for the application you're building.
|
||||
|
||||
### STEP 2: START SERVERS (IF NOT RUNNING)
|
||||
|
||||
If `init.sh` exists, run it:
|
||||
|
||||
```bash
|
||||
chmod +x init.sh
|
||||
./init.sh
|
||||
```
|
||||
|
||||
Otherwise, start servers manually and document the process.
|
||||
|
||||
### STEP 3: VERIFICATION TEST (CRITICAL!)
|
||||
|
||||
**MANDATORY BEFORE NEW WORK:**
|
||||
|
||||
The previous session may have introduced bugs. Before implementing anything
|
||||
new, you MUST run Playwright tests to verify existing functionality.
|
||||
|
||||
```bash
|
||||
# Run all existing Playwright tests
|
||||
npx playwright test
|
||||
|
||||
# Or run tests for a specific feature
|
||||
npx playwright test tests/[feature-name].spec.ts
|
||||
```
|
||||
|
||||
If Playwright tests don't exist yet, create them in a `tests/` directory before proceeding.
|
||||
|
||||
**If any tests fail:**
|
||||
|
||||
- Mark that feature as "passes": false immediately in feature_list.json
|
||||
- Fix all failing tests BEFORE moving to new features
|
||||
- This includes UI bugs like:
|
||||
- White-on-white text or poor contrast
|
||||
- Random characters displayed
|
||||
- Incorrect timestamps
|
||||
- Layout issues or overflow
|
||||
- Buttons too close together
|
||||
- Missing hover states
|
||||
- Console errors
|
||||
|
||||
### STEP 4: CHOOSE ONE FEATURE TO IMPLEMENT
|
||||
|
||||
Look at feature_list.json and find the highest-priority feature with "passes": false.
|
||||
|
||||
Focus on completing one feature perfectly and completing its testing steps in this session before moving on to other features.
|
||||
It's ok if you only complete one feature in this session, as there will be more sessions later that continue to make progress.
|
||||
|
||||
### STEP 5: IMPLEMENT THE FEATURE
|
||||
|
||||
Implement the chosen feature thoroughly:
|
||||
|
||||
1. Write the code (frontend and/or backend as needed)
|
||||
2. Write a Playwright happy path test for the feature (see Step 6)
|
||||
3. Run the test and fix any issues discovered
|
||||
4. Verify all tests pass before moving on
|
||||
|
||||
### STEP 6: VERIFY WITH PLAYWRIGHT TESTS
|
||||
|
||||
**CRITICAL:** You MUST verify features by writing and running Playwright tests.
|
||||
|
||||
**Write Happy Path Tests:**
|
||||
|
||||
For each feature, write a Playwright test that covers the happy path - the main user flow that should work correctly. These tests are fast to run and provide quick feedback.
|
||||
|
||||
```bash
|
||||
# Example: Create test file
|
||||
# tests/[feature-name].spec.ts
|
||||
|
||||
# Run the specific test
|
||||
npx playwright test tests/[feature-name].spec.ts
|
||||
|
||||
# Run with headed mode to see the browser (useful for debugging)
|
||||
npx playwright test tests/[feature-name].spec.ts --headed
|
||||
```
|
||||
|
||||
**Test Structure (example):**
|
||||
|
||||
```typescript
|
||||
import { test, expect } from "@playwright/test";
|
||||
|
||||
test("user can send a message and receive response", async ({ page }) => {
|
||||
await page.goto("http://localhost:3000");
|
||||
|
||||
// Happy path: main user flow
|
||||
await page.fill('[data-testid="message-input"]', "Hello world");
|
||||
await page.click('[data-testid="send-button"]');
|
||||
|
||||
// Verify the expected outcome
|
||||
await expect(page.locator('[data-testid="message-list"]')).toContainText(
|
||||
"Hello world"
|
||||
);
|
||||
});
|
||||
```
|
||||
|
||||
**DO:**
|
||||
|
||||
- Write tests that cover the primary user workflow (happy path)
|
||||
- Use `data-testid` attributes for reliable selectors
|
||||
- Run tests frequently during development
|
||||
- Keep tests fast and focused
|
||||
|
||||
**DON'T:**
|
||||
|
||||
- Only test with curl commands (backend testing alone is insufficient)
|
||||
- Write overly complex tests with many edge cases initially
|
||||
- Skip running tests before marking features as passing
|
||||
- Mark tests passing without all Playwright tests green
|
||||
- Increase any playwright timeouts past 10s
|
||||
|
||||
### STEP 7: UPDATE feature_list.json (CAREFULLY!)
|
||||
|
||||
**YOU CAN ONLY MODIFY ONE FIELD: "passes"**
|
||||
|
||||
After thorough verification, change:
|
||||
|
||||
```json
|
||||
"passes": false
|
||||
```
|
||||
|
||||
to:
|
||||
|
||||
```json
|
||||
"passes": true
|
||||
```
|
||||
|
||||
**NEVER:**
|
||||
|
||||
- Remove tests
|
||||
- Edit test descriptions
|
||||
- Modify test steps
|
||||
- Combine or consolidate tests
|
||||
- Reorder tests
|
||||
|
||||
**ONLY CHANGE "passes" FIELD AFTER ALL PLAYWRIGHT TESTS PASS.**
|
||||
|
||||
### STEP 8: COMMIT YOUR PROGRESS
|
||||
|
||||
Make a descriptive git commit:
|
||||
|
||||
```bash
|
||||
git add .
|
||||
git commit -m "Implement [feature name] - verified with Playwright tests
|
||||
|
||||
- Added [specific changes]
|
||||
- Added/updated Playwright tests in tests/
|
||||
- All tests passing
|
||||
- Updated feature_list.json: marked test #X as passing
|
||||
"
|
||||
git push origin main
|
||||
```
|
||||
|
||||
### STEP 9: UPDATE PROGRESS NOTES
|
||||
|
||||
Update `claude-progress.txt` with:
|
||||
|
||||
- What you accomplished this session
|
||||
- Which test(s) you completed
|
||||
- Any issues discovered or fixed
|
||||
- What should be worked on next
|
||||
- Current completion status (e.g., "45/200 tests passing")
|
||||
|
||||
### STEP 10: END SESSION CLEANLY
|
||||
|
||||
Before context fills up:
|
||||
|
||||
1. Commit all working code
|
||||
2. Update claude-progress.txt
|
||||
3. Update feature_list.json if tests verified
|
||||
4. Ensure no uncommitted changes
|
||||
5. Leave app in working state (no broken features)
|
||||
|
||||
---
|
||||
|
||||
## TESTING REQUIREMENTS
|
||||
|
||||
**ALL testing must use Playwright tests.**
|
||||
|
||||
**Setup (if not already done):**
|
||||
|
||||
```bash
|
||||
# Install Playwright
|
||||
npm install -D @playwright/test
|
||||
|
||||
# Install browsers
|
||||
npx playwright install
|
||||
```
|
||||
|
||||
**Writing Tests:**
|
||||
|
||||
Create tests in the `tests/` directory with `.spec.ts` extension.
|
||||
|
||||
```typescript
|
||||
// tests/example.spec.ts
|
||||
import { test, expect } from "@playwright/test";
|
||||
|
||||
test.describe("Feature Name", () => {
|
||||
test("happy path: user completes main workflow", async ({ page }) => {
|
||||
await page.goto("http://localhost:3000");
|
||||
|
||||
// Interact with UI elements
|
||||
await page.click('button[data-testid="action"]');
|
||||
await page.fill('input[data-testid="input"]', "test value");
|
||||
|
||||
// Assert expected outcomes
|
||||
await expect(page.locator('[data-testid="result"]')).toBeVisible();
|
||||
});
|
||||
});
|
||||
```
|
||||
|
||||
**Running Tests:**
|
||||
|
||||
```bash
|
||||
# Run all tests (fast, headless)
|
||||
npx playwright test
|
||||
|
||||
# Run specific test file
|
||||
npx playwright test tests/feature.spec.ts
|
||||
|
||||
# Run with browser visible (for debugging)
|
||||
npx playwright test --headed
|
||||
|
||||
# Run with UI mode (interactive debugging)
|
||||
npx playwright test --ui
|
||||
```
|
||||
|
||||
**Best Practices:**
|
||||
|
||||
- Add `data-testid` attributes to elements for reliable selectors
|
||||
- Focus on happy path tests first - they're fast and catch most regressions
|
||||
- Keep tests independent and isolated
|
||||
- Write tests as you implement features, not after
|
||||
|
||||
---
|
||||
|
||||
## IMPORTANT REMINDERS
|
||||
|
||||
**Your Goal:** Production-quality application with all 200+ tests passing
|
||||
|
||||
**This Session's Goal:** Complete at least one feature perfectly
|
||||
|
||||
**Priority:** Fix broken tests before implementing new features
|
||||
|
||||
**Quality Bar:**
|
||||
|
||||
- Zero console errors
|
||||
- Polished UI matching the design specified in app_spec.txt (use landing page and generate page for true north of how design should look and be polished)
|
||||
- All features work end-to-end through the UI
|
||||
- Fast, responsive, professional
|
||||
|
||||
**You have unlimited time.** Take as long as needed to get it right. The most important thing is that you
|
||||
leave the code base in a clean state before terminating the session (Step 10).
|
||||
|
||||
---
|
||||
|
||||
Begin by running Step 1 (Get Your Bearings).
|
||||
106
reference/prompts/initializer_prompt.md
Normal file
106
reference/prompts/initializer_prompt.md
Normal file
@@ -0,0 +1,106 @@
|
||||
## YOUR ROLE - INITIALIZER AGENT (Session 1 of Many)
|
||||
|
||||
You are the FIRST agent in a long-running autonomous development process.
|
||||
Your job is to set up the foundation for all future coding agents.
|
||||
|
||||
### FIRST: Read the Project Specification
|
||||
|
||||
Start by reading `app_spec.txt` in your working directory. This file contains
|
||||
the complete specification for what you need to build. Read it carefully
|
||||
before proceeding.
|
||||
|
||||
### CRITICAL FIRST TASK: Create feature_list.json
|
||||
|
||||
Based on `app_spec.txt`, create a file called `feature_list.json` with 200 detailed
|
||||
end-to-end test cases. This file is the single source of truth for what
|
||||
needs to be built.
|
||||
|
||||
**Format:**
|
||||
```json
|
||||
[
|
||||
{
|
||||
"category": "functional",
|
||||
"description": "Brief description of the feature and what this test verifies",
|
||||
"steps": [
|
||||
"Step 1: Navigate to relevant page",
|
||||
"Step 2: Perform action",
|
||||
"Step 3: Verify expected result"
|
||||
],
|
||||
"passes": false
|
||||
},
|
||||
{
|
||||
"category": "style",
|
||||
"description": "Brief description of UI/UX requirement",
|
||||
"steps": [
|
||||
"Step 1: Navigate to page",
|
||||
"Step 2: Take screenshot",
|
||||
"Step 3: Verify visual requirements"
|
||||
],
|
||||
"passes": false
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
**Requirements for feature_list.json:**
|
||||
- Minimum 200 features total with testing steps for each
|
||||
- Both "functional" and "style" categories
|
||||
- Mix of narrow tests (2-5 steps) and comprehensive tests (10+ steps)
|
||||
- At least 25 tests MUST have 10+ steps each
|
||||
- Order features by priority: fundamental features first
|
||||
- ALL tests start with "passes": false
|
||||
- Cover every feature in the spec exhaustively
|
||||
|
||||
**CRITICAL INSTRUCTION:**
|
||||
IT IS CATASTROPHIC TO REMOVE OR EDIT FEATURES IN FUTURE SESSIONS.
|
||||
Features can ONLY be marked as passing (change "passes": false to "passes": true).
|
||||
Never remove features, never edit descriptions, never modify testing steps.
|
||||
This ensures no functionality is missed.
|
||||
|
||||
### SECOND TASK: Create init.sh
|
||||
|
||||
Create a script called `init.sh` that future agents can use to quickly
|
||||
set up and run the development environment. The script should:
|
||||
|
||||
1. Install any required dependencies
|
||||
2. Start any necessary servers or services
|
||||
3. Print helpful information about how to access the running application
|
||||
|
||||
Base the script on the technology stack specified in `app_spec.txt`.
|
||||
|
||||
### THIRD TASK: Initialize Git
|
||||
|
||||
Create a git repository and make your first commit with:
|
||||
- feature_list.json (complete with all 200+ features)
|
||||
- init.sh (environment setup script)
|
||||
- README.md (project overview and setup instructions)
|
||||
|
||||
Commit message: "Initial setup: feature_list.json, init.sh, and project structure"
|
||||
|
||||
### FOURTH TASK: Create Project Structure
|
||||
|
||||
Set up the basic project structure based on what's specified in `app_spec.txt`.
|
||||
This typically includes directories for frontend, backend, and any other
|
||||
components mentioned in the spec.
|
||||
|
||||
### OPTIONAL: Start Implementation
|
||||
|
||||
If you have time remaining in this session, you may begin implementing
|
||||
the highest-priority features from feature_list.json. Remember:
|
||||
- Work on ONE feature at a time
|
||||
- Test thoroughly before marking "passes": true
|
||||
- Commit your progress before session ends
|
||||
|
||||
### ENDING THIS SESSION
|
||||
|
||||
Before your context fills up:
|
||||
1. Commit all work with descriptive messages
|
||||
2. Create `claude-progress.txt` with a summary of what you accomplished
|
||||
3. Ensure feature_list.json is complete and saved
|
||||
4. Leave the environment in a clean, working state
|
||||
|
||||
The next agent will continue from here with a fresh context window.
|
||||
|
||||
---
|
||||
|
||||
**Remember:** You have unlimited time across many sessions. Focus on
|
||||
quality over speed. Production-ready is the goal.
|
||||
1
reference/requirements.txt
Normal file
1
reference/requirements.txt
Normal file
@@ -0,0 +1 @@
|
||||
claude-code-sdk>=0.0.25
|
||||
370
reference/security.py
Normal file
370
reference/security.py
Normal file
@@ -0,0 +1,370 @@
|
||||
"""
|
||||
Security Hooks for Autonomous Coding Agent
|
||||
==========================================
|
||||
|
||||
Pre-tool-use hooks that validate bash commands for security.
|
||||
Uses an allowlist approach - only explicitly permitted commands can run.
|
||||
"""
|
||||
|
||||
import os
|
||||
import shlex
|
||||
|
||||
|
||||
# Allowed commands for development tasks
|
||||
# Minimal set needed for the autonomous coding demo
|
||||
ALLOWED_COMMANDS = {
|
||||
# File inspection
|
||||
"ls",
|
||||
"cat",
|
||||
"head",
|
||||
"tail",
|
||||
"wc",
|
||||
"grep",
|
||||
# File operations (agent uses SDK tools for most file ops, but cp/mkdir needed occasionally)
|
||||
"cp",
|
||||
"mkdir",
|
||||
"chmod", # For making scripts executable; validated separately
|
||||
# Directory
|
||||
"pwd",
|
||||
# Node.js development
|
||||
"npm",
|
||||
"node",
|
||||
# Version control
|
||||
"git",
|
||||
# Process management
|
||||
"ps",
|
||||
"lsof",
|
||||
"sleep",
|
||||
"pkill", # For killing dev servers; validated separately
|
||||
# Script execution
|
||||
"init.sh", # Init scripts; validated separately
|
||||
# JSON manipulation
|
||||
"jq",
|
||||
# Networking
|
||||
"curl",
|
||||
# Utility
|
||||
"xargs",
|
||||
"echo",
|
||||
"mv",
|
||||
"cp",
|
||||
"rm",
|
||||
"npx",
|
||||
}
|
||||
|
||||
# Commands that need additional validation even when in the allowlist
|
||||
COMMANDS_NEEDING_EXTRA_VALIDATION = {"pkill", "chmod", "init.sh"}
|
||||
|
||||
|
||||
def split_command_segments(command_string: str) -> list[str]:
|
||||
"""
|
||||
Split a compound command into individual command segments.
|
||||
|
||||
Handles command chaining (&&, ||, ;) but not pipes (those are single commands).
|
||||
|
||||
Args:
|
||||
command_string: The full shell command
|
||||
|
||||
Returns:
|
||||
List of individual command segments
|
||||
"""
|
||||
import re
|
||||
|
||||
# Split on && and || while preserving the ability to handle each segment
|
||||
# This regex splits on && or || that aren't inside quotes
|
||||
segments = re.split(r"\s*(?:&&|\|\|)\s*", command_string)
|
||||
|
||||
# Further split on semicolons
|
||||
result = []
|
||||
for segment in segments:
|
||||
sub_segments = re.split(r'(?<!["\'])\s*;\s*(?!["\'])', segment)
|
||||
for sub in sub_segments:
|
||||
sub = sub.strip()
|
||||
if sub:
|
||||
result.append(sub)
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def extract_commands(command_string: str) -> list[str]:
|
||||
"""
|
||||
Extract command names from a shell command string.
|
||||
|
||||
Handles pipes, command chaining (&&, ||, ;), and subshells.
|
||||
Returns the base command names (without paths).
|
||||
|
||||
Args:
|
||||
command_string: The full shell command
|
||||
|
||||
Returns:
|
||||
List of command names found in the string
|
||||
"""
|
||||
commands = []
|
||||
|
||||
# shlex doesn't treat ; as a separator, so we need to pre-process
|
||||
import re
|
||||
|
||||
# Split on semicolons that aren't inside quotes (simple heuristic)
|
||||
# This handles common cases like "echo hello; ls"
|
||||
segments = re.split(r'(?<!["\'])\s*;\s*(?!["\'])', command_string)
|
||||
|
||||
for segment in segments:
|
||||
segment = segment.strip()
|
||||
if not segment:
|
||||
continue
|
||||
|
||||
try:
|
||||
tokens = shlex.split(segment)
|
||||
except ValueError:
|
||||
# Malformed command (unclosed quotes, etc.)
|
||||
# Return empty to trigger block (fail-safe)
|
||||
return []
|
||||
|
||||
if not tokens:
|
||||
continue
|
||||
|
||||
# Track when we expect a command vs arguments
|
||||
expect_command = True
|
||||
|
||||
for token in tokens:
|
||||
# Shell operators indicate a new command follows
|
||||
if token in ("|", "||", "&&", "&"):
|
||||
expect_command = True
|
||||
continue
|
||||
|
||||
# Skip shell keywords that precede commands
|
||||
if token in (
|
||||
"if",
|
||||
"then",
|
||||
"else",
|
||||
"elif",
|
||||
"fi",
|
||||
"for",
|
||||
"while",
|
||||
"until",
|
||||
"do",
|
||||
"done",
|
||||
"case",
|
||||
"esac",
|
||||
"in",
|
||||
"!",
|
||||
"{",
|
||||
"}",
|
||||
):
|
||||
continue
|
||||
|
||||
# Skip flags/options
|
||||
if token.startswith("-"):
|
||||
continue
|
||||
|
||||
# Skip variable assignments (VAR=value)
|
||||
if "=" in token and not token.startswith("="):
|
||||
continue
|
||||
|
||||
if expect_command:
|
||||
# Extract the base command name (handle paths like /usr/bin/python)
|
||||
cmd = os.path.basename(token)
|
||||
commands.append(cmd)
|
||||
expect_command = False
|
||||
|
||||
return commands
|
||||
|
||||
|
||||
def validate_pkill_command(command_string: str) -> tuple[bool, str]:
|
||||
"""
|
||||
Validate pkill commands - only allow killing dev-related processes.
|
||||
|
||||
Uses shlex to parse the command, avoiding regex bypass vulnerabilities.
|
||||
|
||||
Returns:
|
||||
Tuple of (is_allowed, reason_if_blocked)
|
||||
"""
|
||||
# Allowed process names for pkill
|
||||
allowed_process_names = {
|
||||
"node",
|
||||
"npm",
|
||||
"npx",
|
||||
"vite",
|
||||
"next",
|
||||
}
|
||||
|
||||
try:
|
||||
tokens = shlex.split(command_string)
|
||||
except ValueError:
|
||||
return False, "Could not parse pkill command"
|
||||
|
||||
if not tokens:
|
||||
return False, "Empty pkill command"
|
||||
|
||||
# Separate flags from arguments
|
||||
args = []
|
||||
for token in tokens[1:]:
|
||||
if not token.startswith("-"):
|
||||
args.append(token)
|
||||
|
||||
if not args:
|
||||
return False, "pkill requires a process name"
|
||||
|
||||
# The target is typically the last non-flag argument
|
||||
target = args[-1]
|
||||
|
||||
# For -f flag (full command line match), extract the first word as process name
|
||||
# e.g., "pkill -f 'node server.js'" -> target is "node server.js", process is "node"
|
||||
if " " in target:
|
||||
target = target.split()[0]
|
||||
|
||||
if target in allowed_process_names:
|
||||
return True, ""
|
||||
return False, f"pkill only allowed for dev processes: {allowed_process_names}"
|
||||
|
||||
|
||||
def validate_chmod_command(command_string: str) -> tuple[bool, str]:
|
||||
"""
|
||||
Validate chmod commands - only allow making files executable with +x.
|
||||
|
||||
Returns:
|
||||
Tuple of (is_allowed, reason_if_blocked)
|
||||
"""
|
||||
try:
|
||||
tokens = shlex.split(command_string)
|
||||
except ValueError:
|
||||
return False, "Could not parse chmod command"
|
||||
|
||||
if not tokens or tokens[0] != "chmod":
|
||||
return False, "Not a chmod command"
|
||||
|
||||
# Look for the mode argument
|
||||
# Valid modes: +x, u+x, a+x, etc. (anything ending with +x for execute permission)
|
||||
mode = None
|
||||
files = []
|
||||
|
||||
for token in tokens[1:]:
|
||||
if token.startswith("-"):
|
||||
# Skip flags like -R (we don't allow recursive chmod anyway)
|
||||
return False, "chmod flags are not allowed"
|
||||
elif mode is None:
|
||||
mode = token
|
||||
else:
|
||||
files.append(token)
|
||||
|
||||
if mode is None:
|
||||
return False, "chmod requires a mode"
|
||||
|
||||
if not files:
|
||||
return False, "chmod requires at least one file"
|
||||
|
||||
# Only allow +x variants (making files executable)
|
||||
# This matches: +x, u+x, g+x, o+x, a+x, ug+x, etc.
|
||||
import re
|
||||
|
||||
if not re.match(r"^[ugoa]*\+x$", mode):
|
||||
return False, f"chmod only allowed with +x mode, got: {mode}"
|
||||
|
||||
return True, ""
|
||||
|
||||
|
||||
def validate_init_script(command_string: str) -> tuple[bool, str]:
|
||||
"""
|
||||
Validate init.sh script execution - only allow ./init.sh.
|
||||
|
||||
Returns:
|
||||
Tuple of (is_allowed, reason_if_blocked)
|
||||
"""
|
||||
try:
|
||||
tokens = shlex.split(command_string)
|
||||
except ValueError:
|
||||
return False, "Could not parse init script command"
|
||||
|
||||
if not tokens:
|
||||
return False, "Empty command"
|
||||
|
||||
# The command should be exactly ./init.sh (possibly with arguments)
|
||||
script = tokens[0]
|
||||
|
||||
# Allow ./init.sh or paths ending in /init.sh
|
||||
if script == "./init.sh" or script.endswith("/init.sh"):
|
||||
return True, ""
|
||||
|
||||
return False, f"Only ./init.sh is allowed, got: {script}"
|
||||
|
||||
|
||||
def get_command_for_validation(cmd: str, segments: list[str]) -> str:
|
||||
"""
|
||||
Find the specific command segment that contains the given command.
|
||||
|
||||
Args:
|
||||
cmd: The command name to find
|
||||
segments: List of command segments
|
||||
|
||||
Returns:
|
||||
The segment containing the command, or empty string if not found
|
||||
"""
|
||||
for segment in segments:
|
||||
segment_commands = extract_commands(segment)
|
||||
if cmd in segment_commands:
|
||||
return segment
|
||||
return ""
|
||||
|
||||
|
||||
async def bash_security_hook(input_data, tool_use_id=None, context=None):
|
||||
"""
|
||||
Pre-tool-use hook that validates bash commands using an allowlist.
|
||||
|
||||
Only commands in ALLOWED_COMMANDS are permitted.
|
||||
|
||||
Args:
|
||||
input_data: Dict containing tool_name and tool_input
|
||||
tool_use_id: Optional tool use ID
|
||||
context: Optional context
|
||||
|
||||
Returns:
|
||||
Empty dict to allow, or {"decision": "block", "reason": "..."} to block
|
||||
"""
|
||||
if input_data.get("tool_name") != "Bash":
|
||||
return {}
|
||||
|
||||
command = input_data.get("tool_input", {}).get("command", "")
|
||||
if not command:
|
||||
return {}
|
||||
|
||||
# Extract all commands from the command string
|
||||
commands = extract_commands(command)
|
||||
|
||||
if not commands:
|
||||
# Could not parse - fail safe by blocking
|
||||
return {
|
||||
"decision": "block",
|
||||
"reason": f"Could not parse command for security validation: {command}",
|
||||
}
|
||||
|
||||
# Split into segments for per-command validation
|
||||
segments = split_command_segments(command)
|
||||
|
||||
# Check each command against the allowlist
|
||||
for cmd in commands:
|
||||
if cmd not in ALLOWED_COMMANDS:
|
||||
return {
|
||||
"decision": "block",
|
||||
"reason": f"Command '{cmd}' is not in the allowed commands list",
|
||||
}
|
||||
|
||||
# Additional validation for sensitive commands
|
||||
if cmd in COMMANDS_NEEDING_EXTRA_VALIDATION:
|
||||
# Find the specific segment containing this command
|
||||
cmd_segment = get_command_for_validation(cmd, segments)
|
||||
if not cmd_segment:
|
||||
cmd_segment = command # Fallback to full command
|
||||
|
||||
if cmd == "pkill":
|
||||
allowed, reason = validate_pkill_command(cmd_segment)
|
||||
if not allowed:
|
||||
return {"decision": "block", "reason": reason}
|
||||
elif cmd == "chmod":
|
||||
allowed, reason = validate_chmod_command(cmd_segment)
|
||||
if not allowed:
|
||||
return {"decision": "block", "reason": reason}
|
||||
elif cmd == "init.sh":
|
||||
allowed, reason = validate_init_script(cmd_segment)
|
||||
if not allowed:
|
||||
return {"decision": "block", "reason": reason}
|
||||
|
||||
return {}
|
||||
290
reference/test_security.py
Normal file
290
reference/test_security.py
Normal file
@@ -0,0 +1,290 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Security Hook Tests
|
||||
===================
|
||||
|
||||
Tests for the bash command security validation logic.
|
||||
Run with: python test_security.py
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import sys
|
||||
|
||||
from security import (
|
||||
bash_security_hook,
|
||||
extract_commands,
|
||||
validate_chmod_command,
|
||||
validate_init_script,
|
||||
)
|
||||
|
||||
|
||||
def test_hook(command: str, should_block: bool) -> bool:
|
||||
"""Test a single command against the security hook."""
|
||||
input_data = {"tool_name": "Bash", "tool_input": {"command": command}}
|
||||
result = asyncio.run(bash_security_hook(input_data))
|
||||
was_blocked = result.get("decision") == "block"
|
||||
|
||||
if was_blocked == should_block:
|
||||
status = "PASS"
|
||||
else:
|
||||
status = "FAIL"
|
||||
expected = "blocked" if should_block else "allowed"
|
||||
actual = "blocked" if was_blocked else "allowed"
|
||||
reason = result.get("reason", "")
|
||||
print(f" {status}: {command!r}")
|
||||
print(f" Expected: {expected}, Got: {actual}")
|
||||
if reason:
|
||||
print(f" Reason: {reason}")
|
||||
return False
|
||||
|
||||
print(f" {status}: {command!r}")
|
||||
return True
|
||||
|
||||
|
||||
def test_extract_commands():
|
||||
"""Test the command extraction logic."""
|
||||
print("\nTesting command extraction:\n")
|
||||
passed = 0
|
||||
failed = 0
|
||||
|
||||
test_cases = [
|
||||
("ls -la", ["ls"]),
|
||||
("npm install && npm run build", ["npm", "npm"]),
|
||||
("cat file.txt | grep pattern", ["cat", "grep"]),
|
||||
("/usr/bin/node script.js", ["node"]),
|
||||
("VAR=value ls", ["ls"]),
|
||||
("git status || git init", ["git", "git"]),
|
||||
]
|
||||
|
||||
for cmd, expected in test_cases:
|
||||
result = extract_commands(cmd)
|
||||
if result == expected:
|
||||
print(f" PASS: {cmd!r} -> {result}")
|
||||
passed += 1
|
||||
else:
|
||||
print(f" FAIL: {cmd!r}")
|
||||
print(f" Expected: {expected}, Got: {result}")
|
||||
failed += 1
|
||||
|
||||
return passed, failed
|
||||
|
||||
|
||||
def test_validate_chmod():
|
||||
"""Test chmod command validation."""
|
||||
print("\nTesting chmod validation:\n")
|
||||
passed = 0
|
||||
failed = 0
|
||||
|
||||
# Test cases: (command, should_be_allowed, description)
|
||||
test_cases = [
|
||||
# Allowed cases
|
||||
("chmod +x init.sh", True, "basic +x"),
|
||||
("chmod +x script.sh", True, "+x on any script"),
|
||||
("chmod u+x init.sh", True, "user +x"),
|
||||
("chmod a+x init.sh", True, "all +x"),
|
||||
("chmod ug+x init.sh", True, "user+group +x"),
|
||||
("chmod +x file1.sh file2.sh", True, "multiple files"),
|
||||
# Blocked cases
|
||||
("chmod 777 init.sh", False, "numeric mode"),
|
||||
("chmod 755 init.sh", False, "numeric mode 755"),
|
||||
("chmod +w init.sh", False, "write permission"),
|
||||
("chmod +r init.sh", False, "read permission"),
|
||||
("chmod -x init.sh", False, "remove execute"),
|
||||
("chmod -R +x dir/", False, "recursive flag"),
|
||||
("chmod --recursive +x dir/", False, "long recursive flag"),
|
||||
("chmod +x", False, "missing file"),
|
||||
]
|
||||
|
||||
for cmd, should_allow, description in test_cases:
|
||||
allowed, reason = validate_chmod_command(cmd)
|
||||
if allowed == should_allow:
|
||||
print(f" PASS: {cmd!r} ({description})")
|
||||
passed += 1
|
||||
else:
|
||||
expected = "allowed" if should_allow else "blocked"
|
||||
actual = "allowed" if allowed else "blocked"
|
||||
print(f" FAIL: {cmd!r} ({description})")
|
||||
print(f" Expected: {expected}, Got: {actual}")
|
||||
if reason:
|
||||
print(f" Reason: {reason}")
|
||||
failed += 1
|
||||
|
||||
return passed, failed
|
||||
|
||||
|
||||
def test_validate_init_script():
|
||||
"""Test init.sh script execution validation."""
|
||||
print("\nTesting init.sh validation:\n")
|
||||
passed = 0
|
||||
failed = 0
|
||||
|
||||
# Test cases: (command, should_be_allowed, description)
|
||||
test_cases = [
|
||||
# Allowed cases
|
||||
("./init.sh", True, "basic ./init.sh"),
|
||||
("./init.sh arg1 arg2", True, "with arguments"),
|
||||
("/path/to/init.sh", True, "absolute path"),
|
||||
("../dir/init.sh", True, "relative path with init.sh"),
|
||||
# Blocked cases
|
||||
("./setup.sh", False, "different script name"),
|
||||
("./init.py", False, "python script"),
|
||||
("bash init.sh", False, "bash invocation"),
|
||||
("sh init.sh", False, "sh invocation"),
|
||||
("./malicious.sh", False, "malicious script"),
|
||||
("./init.sh; rm -rf /", False, "command injection attempt"),
|
||||
]
|
||||
|
||||
for cmd, should_allow, description in test_cases:
|
||||
allowed, reason = validate_init_script(cmd)
|
||||
if allowed == should_allow:
|
||||
print(f" PASS: {cmd!r} ({description})")
|
||||
passed += 1
|
||||
else:
|
||||
expected = "allowed" if should_allow else "blocked"
|
||||
actual = "allowed" if allowed else "blocked"
|
||||
print(f" FAIL: {cmd!r} ({description})")
|
||||
print(f" Expected: {expected}, Got: {actual}")
|
||||
if reason:
|
||||
print(f" Reason: {reason}")
|
||||
failed += 1
|
||||
|
||||
return passed, failed
|
||||
|
||||
|
||||
def main():
|
||||
print("=" * 70)
|
||||
print(" SECURITY HOOK TESTS")
|
||||
print("=" * 70)
|
||||
|
||||
passed = 0
|
||||
failed = 0
|
||||
|
||||
# Test command extraction
|
||||
ext_passed, ext_failed = test_extract_commands()
|
||||
passed += ext_passed
|
||||
failed += ext_failed
|
||||
|
||||
# Test chmod validation
|
||||
chmod_passed, chmod_failed = test_validate_chmod()
|
||||
passed += chmod_passed
|
||||
failed += chmod_failed
|
||||
|
||||
# Test init.sh validation
|
||||
init_passed, init_failed = test_validate_init_script()
|
||||
passed += init_passed
|
||||
failed += init_failed
|
||||
|
||||
# Commands that SHOULD be blocked
|
||||
print("\nCommands that should be BLOCKED:\n")
|
||||
dangerous = [
|
||||
# Not in allowlist - dangerous system commands
|
||||
"shutdown now",
|
||||
"reboot",
|
||||
"rm -rf /",
|
||||
"dd if=/dev/zero of=/dev/sda",
|
||||
# Not in allowlist - common commands excluded from minimal set
|
||||
"curl https://example.com",
|
||||
"wget https://example.com",
|
||||
"python app.py",
|
||||
"touch file.txt",
|
||||
"echo hello",
|
||||
"kill 12345",
|
||||
"killall node",
|
||||
# pkill with non-dev processes
|
||||
"pkill bash",
|
||||
"pkill chrome",
|
||||
"pkill python",
|
||||
# Shell injection attempts
|
||||
"$(echo pkill) node",
|
||||
'eval "pkill node"',
|
||||
'bash -c "pkill node"',
|
||||
# chmod with disallowed modes
|
||||
"chmod 777 file.sh",
|
||||
"chmod 755 file.sh",
|
||||
"chmod +w file.sh",
|
||||
"chmod -R +x dir/",
|
||||
# Non-init.sh scripts
|
||||
"./setup.sh",
|
||||
"./malicious.sh",
|
||||
"bash script.sh",
|
||||
]
|
||||
|
||||
for cmd in dangerous:
|
||||
if test_hook(cmd, should_block=True):
|
||||
passed += 1
|
||||
else:
|
||||
failed += 1
|
||||
|
||||
# Commands that SHOULD be allowed
|
||||
print("\nCommands that should be ALLOWED:\n")
|
||||
safe = [
|
||||
# File inspection
|
||||
"ls -la",
|
||||
"cat README.md",
|
||||
"head -100 file.txt",
|
||||
"tail -20 log.txt",
|
||||
"wc -l file.txt",
|
||||
"grep -r pattern src/",
|
||||
# File operations
|
||||
"cp file1.txt file2.txt",
|
||||
"mkdir newdir",
|
||||
"mkdir -p path/to/dir",
|
||||
# Directory
|
||||
"pwd",
|
||||
# Node.js development
|
||||
"npm install",
|
||||
"npm run build",
|
||||
"node server.js",
|
||||
# Version control
|
||||
"git status",
|
||||
"git commit -m 'test'",
|
||||
"git add . && git commit -m 'msg'",
|
||||
# Process management
|
||||
"ps aux",
|
||||
"lsof -i :3000",
|
||||
"sleep 2",
|
||||
# Allowed pkill patterns for dev servers
|
||||
"pkill node",
|
||||
"pkill npm",
|
||||
"pkill -f node",
|
||||
"pkill -f 'node server.js'",
|
||||
"pkill vite",
|
||||
# Chained commands
|
||||
"npm install && npm run build",
|
||||
"ls | grep test",
|
||||
# Full paths
|
||||
"/usr/local/bin/node app.js",
|
||||
# chmod +x (allowed)
|
||||
"chmod +x init.sh",
|
||||
"chmod +x script.sh",
|
||||
"chmod u+x init.sh",
|
||||
"chmod a+x init.sh",
|
||||
# init.sh execution (allowed)
|
||||
"./init.sh",
|
||||
"./init.sh --production",
|
||||
"/path/to/init.sh",
|
||||
# Combined chmod and init.sh
|
||||
"chmod +x init.sh && ./init.sh",
|
||||
]
|
||||
|
||||
for cmd in safe:
|
||||
if test_hook(cmd, should_block=False):
|
||||
passed += 1
|
||||
else:
|
||||
failed += 1
|
||||
|
||||
# Summary
|
||||
print("\n" + "-" * 70)
|
||||
print(f" Results: {passed} passed, {failed} failed")
|
||||
print("-" * 70)
|
||||
|
||||
if failed == 0:
|
||||
print("\n ALL TESTS PASSED")
|
||||
return 0
|
||||
else:
|
||||
print(f"\n {failed} TEST(S) FAILED")
|
||||
return 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
Reference in New Issue
Block a user