claude-task-master/autonomous-tdd-git-workflow.md at bed63fa99a248d7aa259ada52dee56029cc7520a

Files

Ralph Khreish a50e654e7b Phase 0: TDD Autopilot Dry-Run Foundation (#1282 )

Co-authored-by: Claude <noreply@anthropic.com>

2025-10-16 22:32:21 +02:00

33 KiB

Raw Blame History

Summary

Put the existing git and test workflows on rails: a repeatable, automated process that can run autonomously, with guardrails and a compact TUI for visibility.
Flow: for a selected task, create a branch named with the tag + task id → generate tests for the first subtask (red) using the Surgical Test Generator → implement code (green) → verify tests → commit → repeat per subtask → final verify → push → open PR against the default branch.
Build on existing rules: .cursor/rules/git_workflow.mdc, .cursor/rules/test_workflow.mdc, .claude/agents/surgical-test-generator.md, and existing CLI/core services.

Goals

Deterministic, resumable automation to execute the TDD loop per subtask with minimal human intervention.
Strong guardrails: never commit to the default branch; only commit when tests pass; enforce status transitions; persist logs/state for debuggability.
Visibility: a compact terminal UI (like lazygit) to pick tag, view tasks, and start work; right-side pane opens an executor terminal (via tmux) for agent coding.
Extensible: framework-agnostic test generation via the Surgical Test Generator; detect and use the repo’s test command for execution with coverage thresholds.

Non‑Goals (initial)

Full multi-language runner parity beyond detection and executing the project’s test command.
Complex GUI; start with CLI/TUI + tmux pane. IDE/extension can hook into the same state later.
Rich executor selection UX (codex/gemini/claude) — we’ll prompt per run; defaults can come later.

Success Criteria

One command can autonomously complete a task's subtasks via TDD and open a PR when done.
All commits made on a branch that includes the tag and task id (see Branch Naming); no commits to the default branch directly.
Every subtask iteration: failing tests added first (red), then code added to pass them (green), commit only after green.
End-to-end logs + artifacts stored in .taskmaster/reports/runs//.

Success Metrics (Phase 1)

Adoption: 80% of tasks in a pilot repo completed via tm autopilot
Safety: 0 commits to default branch; 100% of commits have green tests
Efficiency: Average time from task start to PR < 30min for simple subtasks
Reliability: < 5% of runs require manual intervention (timeout/conflicts)

User Stories

As a developer, I can run tm autopilot and watch a structured, safe workflow execute.
As a reviewer, I can inspect commits per subtask, and a PR summarizing the work when the task completes.
As an operator, I can see current step, active subtask, tests status, and logs in a compact CLI view and read a final run report.

Example Workflow Traces

Happy Path: Complete a 3-subtask feature

# Developer starts
$ tm autopilot 42
→ Checks preflight: ✓ clean tree, ✓ npm test detected
→ Creates branch: analytics/task-42-user-metrics
→ Subtask 42.1: "Add metrics schema"
  RED: generates test_metrics_schema.test.js → 3 failures
  GREEN: implements schema.js → all pass
  COMMIT: "feat(metrics): add metrics schema (task 42.1)"
→ Subtask 42.2: "Add collection endpoint"
  RED: generates test_metrics_endpoint.test.js → 5 failures
  GREEN: implements api/metrics.js → all pass
  COMMIT: "feat(metrics): add collection endpoint (task 42.2)"
→ Subtask 42.3: "Add dashboard widget"
  RED: generates test_metrics_widget.test.js → 4 failures
  GREEN: implements components/MetricsWidget.jsx → all pass
  COMMIT: "feat(metrics): add dashboard widget (task 42.3)"
→ Final: all 3 subtasks complete
  ✓ Run full test suite → all pass
  ✓ Coverage check → 85% (meets 80% threshold)
  PUSH: confirms with user → pushed to origin
  PR: opens #123 "Task #42 [analytics]: User metrics tracking"

✓ Task 42 complete. PR: https://github.com/org/repo/pull/123
  Run report: .taskmaster/reports/runs/2025-01-15-142033/

Error Recovery: Failing tests timeout

$ tm autopilot 42
→ Subtask 42.2 GREEN phase: attempt 1 fails (2 tests still red)
→ Subtask 42.2 GREEN phase: attempt 2 fails (1 test still red)
→ Subtask 42.2 GREEN phase: attempt 3 fails (1 test still red)

⚠️  Paused: Could not achieve green state after 3 attempts
📋 State saved to: .taskmaster/reports/runs/2025-01-15-142033/
    Last error: "POST /api/metrics returns 500 instead of 201"

Next steps:
  - Review diff: git diff HEAD
  - Inspect logs: cat .taskmaster/reports/runs/2025-01-15-142033/log.jsonl
  - Check test output: cat .taskmaster/reports/runs/2025-01-15-142033/test-results/subtask-42.2-green-attempt3.json
  - Resume after manual fix: tm autopilot --resume

# Developer manually fixes the issue, then:
$ tm autopilot --resume
→ Resuming subtask 42.2 GREEN phase
  GREEN: all tests pass
  COMMIT: "feat(metrics): add collection endpoint (task 42.2)"
→ Continuing to subtask 42.3...

Dry Run: Preview before execution

$ tm autopilot 42 --dry-run
Autopilot Plan for Task #42 [analytics]: User metrics tracking
─────────────────────────────────────────────────────────────
Preflight:
  ✓ Working tree is clean
  ✓ Test command detected: npm test
  ✓ Tools available: git, gh, node, npm
  ✓ Current branch: main (will create new branch)

Branch & Tag:
  → Create branch: analytics/task-42-user-metrics
  → Set active tag: analytics

Subtasks (3 pending):
  1. 42.1: Add metrics schema
     - RED: generate tests in src/__tests__/schema.test.js
     - GREEN: implement src/schema.js
     - COMMIT: "feat(metrics): add metrics schema (task 42.1)"

  2. 42.2: Add collection endpoint [depends on 42.1]
     - RED: generate tests in src/api/__tests__/metrics.test.js
     - GREEN: implement src/api/metrics.js
     - COMMIT: "feat(metrics): add collection endpoint (task 42.2)"

  3. 42.3: Add dashboard widget [depends on 42.2]
     - RED: generate tests in src/components/__tests__/MetricsWidget.test.jsx
     - GREEN: implement src/components/MetricsWidget.jsx
     - COMMIT: "feat(metrics): add dashboard widget (task 42.3)"

Finalization:
  → Run full test suite with coverage
  → Push branch to origin (will confirm)
  → Create PR targeting main

Run without --dry-run to execute.

High‑Level Workflow

Pre‑flight
- Verify clean working tree or confirm staging/commit policy (configurable).
- Detect repo type and the project’s test command (e.g., npm test, pnpm test, pytest, go test).
- Validate tools: git, gh (optional for PR), node/npm, and (if used) claude CLI.
- Load TaskMaster state and selected task; if no subtasks exist, automatically run “expand” before working.
Branch & Tag Setup
- Checkout default branch and update (optional), then create a branch using Branch Naming (below).
- Map branch ↔ tag via existing tag management; explicitly set active tag to the branch’s tag.
Subtask Loop (for each pending/in-progress subtask in dependency order)
- Select next eligible subtask using tm-core TaskService getNextTask() and subtask eligibility logic.
- Red: generate or update failing tests for the subtask
  - Use the Surgical Test Generator system prompt .claude/agents/surgical-test-generator.md) to produce high-signal tests following project conventions.
  - Run tests to confirm red; record results. If not red (already passing), skip to next subtask or escalate.
- Green: implement code to pass tests
  - Use executor to implement changes (initial: claude CLI prompt with focused context).
  - Re-run tests until green or timeout/backoff policy triggers.
- Commit: when green
  - Commit tests + code with conventional commit message. Optionally update subtask status to done.
  - Persist run step metadata/logs.
Finalization
- Run full test suite and coverage (if configured); optionally lint/format.
- Commit any final adjustments.
- Push branch (ask user to confirm); create PR (via gh pr create) targeting the default branch. Title format: Task # []:

33 KiB Raw Blame History Unescape Escape

Summary

Goals

Non‑Goals (initial)

Success Criteria

Success Metrics (Phase 1)

User Stories

Example Workflow Traces

Happy Path: Complete a 3-subtask feature

Error Recovery: Failing tests timeout

Dry Run: Preview before execution

High‑Level Workflow

33 KiB

Raw Blame History