Files
BMAD-METHOD/docs/explanation/philosophy/testing-as-engineering.md
forcetrainer e535f94325 docs: comprehensive style guide update with reference and glossary standards
Style Guide Additions:
- Add Reference Structure section with 6 document types (Index, Catalog,
  Deep-Dive, Configuration, Glossary, Comprehensive)
- Add Glossary Structure section with table-based format leveraging
  Starlight's right-nav for navigation
- Include checklists for both new document types

Reference Docs Updated:
- agents/index.md: Catalog format, universal commands tip admonition
- configuration/core-tasks.md: Configuration format with admonitions
- configuration/global-config.md: Table-based config reference
- workflows/index.md: Minimal index format
- workflows/core-workflows.md: Catalog format
- workflows/document-project.md: Deep-dive with Quick Facts admonition
- workflows/bmgd-workflows.md: Comprehensive format, removed ~30 hr rules

Glossary Rewritten:
- Converted from 373 lines with ### headers to 123 lines with tables
- Right nav now shows 9 categories instead of 50+ terms
- Added italic context markers (*BMGD.*, *Brownfield.*, etc.)
- Alphabetized terms within categories
- Removed redundant inline TOC

All Docs:
- Remove horizontal rules (---) per style guide
- Remove "Related" sections (sidebar handles navigation)
- Standardize admonition usage
- Archive deleted workflow customization docs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-12 00:23:05 -05:00

5.1 KiB

title, description
title description
AI-Generated Testing: Why Most Approaches Fail How Playwright-Utils, TEA workflows, and Playwright MCPs solve AI test quality problems

AI-generated tests frequently fail in production because they lack systematic quality standards. This document explains the problem and presents a solution combining three components: Playwright-Utils, TEA (Test Architect), and Playwright MCPs.

:::note[Source] This article is adapted from The Testing Meta Most Teams Have Not Caught Up To Yet by Murat K Ozcan. :::

The Problem with AI-Generated Tests

When teams use AI to generate tests without structure, they often produce what can be called "slop factory" outputs:

Issue Description
Redundant coverage Multiple tests covering the same functionality
Incorrect assertions Tests that pass but don't actually verify behavior
Flaky tests Non-deterministic tests that randomly pass or fail
Unreviewable diffs Generated code too verbose or inconsistent to review

The core problem is that prompt-driven testing paths lean into nondeterminism, which is the exact opposite of what testing exists to protect.

:::caution[The Paradox] AI excels at generating code quickly, but testing requires precision and consistency. Without guardrails, AI-generated tests amplify the chaos they're meant to prevent. :::

The Solution: A Three-Part Stack

The solution combines three components that work together to enforce quality:

Playwright-Utils

Bridges the gap between Cypress ergonomics and Playwright's capabilities by standardizing commonly reinvented primitives through utility functions.

Utility Purpose
api-request API calls with schema validation
auth-session Authentication handling
intercept-network-call Network mocking and interception
recurse Retry logic and polling
log Structured logging
network-recorder Record and replay network traffic
burn-in Smart test selection for CI
network-error-monitor HTTP error detection
file-utils CSV/PDF handling

These utilities eliminate the need to reinvent authentication, API calls, retries, and logging for every project.

TEA (Test Architect Agent)

A quality operating model packaged as eight executable workflows spanning test design, CI/CD gates, and release readiness. TEA encodes test architecture expertise into repeatable processes.

Workflow Purpose
*test-design Risk-based test planning per epic
*framework Scaffold production-ready test infrastructure
*ci CI pipeline with selective testing
*atdd Acceptance test-driven development
*automate Prioritized test automation
*test-review Test quality audits (0-100 score)
*nfr-assess Non-functional requirements assessment
*trace Coverage traceability and gate decisions

:::tip[Key Insight] TEA doesn't just generate tests—it provides a complete quality operating model with workflows for planning, execution, and release gates. :::

Playwright MCPs

Model Context Protocols enable real-time verification during test generation. Instead of inferring selectors and behavior from documentation, MCPs allow agents to:

  • Run flows and confirm the DOM against the accessibility tree
  • Validate network responses in real-time
  • Discover actual functionality through interactive exploration
  • Verify generated tests against live applications

How They Work Together

The three components form a quality pipeline:

Stage Component Action
Standards Playwright-Utils Provides production-ready patterns and utilities
Process TEA Workflows Enforces systematic test planning and review
Verification Playwright MCPs Validates generated tests against live applications

Before (AI-only): 20 tests with redundant coverage, incorrect assertions, and flaky behavior.

After (Full Stack): Risk-based selection, verified selectors, validated behavior, reviewable code.

Why This Matters

Traditional AI testing approaches fail because they:

  • Lack quality standards — No consistent patterns or utilities
  • Skip planning — Jump straight to test generation without risk assessment
  • Can't verify — Generate tests without validating against actual behavior
  • Don't review — No systematic audit of generated test quality

The three-part stack addresses each gap:

Gap Solution
No standards Playwright-Utils provides production-ready patterns
No planning TEA *test-design workflow creates risk-based test plans
No verification Playwright MCPs validate against live applications
No review TEA *test-review audits quality with scoring

This approach is sometimes called context engineering—loading domain-specific standards into AI context automatically rather than relying on prompts alone. TEA's tea-index.csv manifest loads relevant knowledge fragments so the AI doesn't relearn testing patterns each session.