Files

Brian Madison 8788c1d20f checklist standardization and improvement with llm eliciatation

2025-06-08 20:34:07 -05:00

25 KiB

Raw Blame History

Infrastructure Change Validation Checklist

This checklist serves as a comprehensive framework for validating infrastructure changes before deployment to production. The DevOps/Platform Engineer should systematically work through each item, ensuring the infrastructure is secure, compliant, resilient, and properly implemented according to organizational standards.

[[LLM: INITIALIZATION INSTRUCTIONS - INFRASTRUCTURE VALIDATION

Before proceeding with this checklist, ensure you have access to:

platform-architecture.md or infrastructure-architecture.md (check docs/platform-architecture.md)
Infrastructure as Code files (Terraform, CloudFormation, Bicep, etc.)
CI/CD pipeline configurations
Security and compliance requirements
Network diagrams and configurations
Monitoring and alerting specifications

IMPORTANT: Infrastructure failures can cause complete outages. This checklist must be thorough.

VALIDATION PRINCIPLES:

Security First - Every decision should consider security implications
Automation - Manual processes are error-prone and don't scale
Resilience - Assume everything will fail and plan accordingly
Compliance - Regulatory requirements are non-negotiable
Cost Awareness - Over-provisioning wastes money, under-provisioning causes outages

EXECUTION MODE: Ask the user if they want to work through the checklist:

Section by section (interactive mode) - Deep dive into each area
All at once (comprehensive mode) - Complete analysis with summary report

REMEMBER: Production infrastructure supports real users and business operations. Mistakes here have immediate, visible impact.]]

1. SECURITY & COMPLIANCE

[[LLM: Security breaches destroy trust and businesses. For each item:

Verify implementation, not just documentation
Check for common vulnerabilities (default passwords, open ports, etc.)
Ensure compliance requirements are actually met, not just considered
Look for defense in depth - multiple layers of security
Consider the blast radius if this security control fails]]

1.1 Access Management

RBAC principles applied with least privilege access
Service accounts have minimal required permissions
Secrets management solution properly implemented
IAM policies and roles documented and reviewed
Access audit mechanisms configured

1.2 Data Protection

Data at rest encryption enabled for all applicable services
Data in transit encryption (TLS 1.2+) enforced
Sensitive data identified and protected appropriately
Backup encryption configured where required
Data access audit trails implemented where required

1.3 Network Security

Network security groups configured with minimal required access
Private endpoints used for PaaS services where available
Public-facing services protected with WAF policies
Network traffic flows documented and secured
Network segmentation properly implemented

1.4 Compliance Requirements

Regulatory compliance requirements verified and met
Security scanning integrated into pipeline
Compliance evidence collection automated where possible
Privacy requirements addressed in infrastructure design
Security monitoring and alerting enabled

2. INFRASTRUCTURE AS CODE

[[LLM: IaC prevents configuration drift and enables disaster recovery. Verify:

EVERYTHING is in code - no "just this once" manual changes
Code quality matches application code standards
State management won't cause conflicts or data loss
Changes can be rolled back safely
New team members can understand and modify the infrastructure]]

2.1 IaC Implementation

All resources defined in IaC (Terraform/Bicep/ARM)
IaC code follows organizational standards and best practices
No manual configuration changes permitted
Dependencies explicitly defined and documented
Modules and resource naming follow conventions

2.2 IaC Quality & Management

IaC code reviewed by at least one other engineer
State files securely stored and backed up
Version control best practices followed
IaC changes tested in non-production environment
Documentation for IaC updated

2.3 Resource Organization

Resources organized in appropriate resource groups
Tags applied consistently per tagging strategy
Resource locks applied where appropriate
Naming conventions followed consistently
Resource dependencies explicitly managed

3. RESILIENCE & AVAILABILITY

[[LLM: Downtime costs money and reputation. Check:

What happens when each component fails?
Are we meeting our SLA commitments?
Has resilience been tested, not just designed?
Can the system handle expected peak load?
Are failure modes graceful or catastrophic?]]

3.1 High Availability

Resources deployed across appropriate availability zones
SLAs for each component documented and verified
Load balancing configured properly
Failover mechanisms tested and verified
Single points of failure identified and mitigated

3.2 Fault Tolerance

Auto-scaling configured where appropriate
Health checks implemented for all services
Circuit breakers implemented where necessary
Retry policies configured for transient failures
Graceful degradation mechanisms implemented

3.3 Recovery Metrics & Testing

Recovery time objectives (RTOs) verified
Recovery point objectives (RPOs) verified
Resilience testing completed and documented
Chaos engineering principles applied where appropriate
Recovery procedures documented and tested

4. BACKUP & DISASTER RECOVERY

[[LLM: Backups are worthless if they don't restore. Validate:

Have restores been tested recently?
Do backup windows meet business needs?
Are backups stored in a different failure domain?
Can we meet our RTO/RPO commitments?
Who has tested the disaster recovery runbook?]]

4.1 Backup Strategy

Backup strategy defined and implemented
Backup retention periods aligned with requirements
Backup recovery tested and validated
Point-in-time recovery configured where needed
Backup access controls implemented

4.2 Disaster Recovery

DR plan documented and accessible
DR runbooks created and tested
Cross-region recovery strategy implemented (if required)
Regular DR drills scheduled
Dependencies considered in DR planning

4.3 Recovery Procedures

System state recovery procedures documented
Data recovery procedures documented
Application recovery procedures aligned with infrastructure
Recovery roles and responsibilities defined
Communication plan for recovery scenarios established

5. MONITORING & OBSERVABILITY

[[LLM: You can't fix what you can't see. Ensure:

Every critical metric has monitoring
Alerts fire BEFORE users complain
Logs are searchable and retained appropriately
Dashboards show what actually matters
Someone knows how to interpret the data]]

5.1 Monitoring Implementation

Monitoring coverage for all critical components
Appropriate metrics collected and dashboarded
Log aggregation implemented
Distributed tracing implemented (if applicable)
User experience/synthetics monitoring configured

5.2 Alerting & Response

Alerts configured for critical thresholds
Alert routing and escalation paths defined
Service health integration configured
On-call procedures documented
Incident response playbooks created

5.3 Operational Visibility

Custom queries/dashboards created for key scenarios
Resource utilization tracking configured
Cost monitoring implemented
Performance baselines established
Operational runbooks available for common issues

6. PERFORMANCE & OPTIMIZATION

[[LLM: Performance impacts user experience and costs. Check:

Has performance been tested under realistic load?
Are we over-provisioned (wasting money)?
Are we under-provisioned (risking outages)?
Do we know our breaking point?
Is autoscaling configured correctly?]]

6.1 Performance Testing

Performance testing completed and baseline established
Resource sizing appropriate for workload
Performance bottlenecks identified and addressed
Latency requirements verified
Throughput requirements verified

6.2 Resource Optimization

Cost optimization opportunities identified
Auto-scaling rules validated
Resource reservation used where appropriate
Storage tier selection optimized
Idle/unused resources identified for cleanup

6.3 Efficiency Mechanisms

Caching strategy implemented where appropriate
CDN/edge caching configured for content
Network latency optimized
Database performance tuned
Compute resource efficiency validated

7. OPERATIONS & GOVERNANCE

[[LLM: Good operations prevent 3am emergencies. Verify:

Can a new team member understand the system?
Are runbooks tested and current?
Do we know who owns what?
Are costs tracked and controlled?
Will auditors be satisfied?]]

7.1 Documentation

Change documentation updated
Runbooks created or updated
Architecture diagrams updated
Configuration values documented
Service dependencies mapped and documented

7.2 Governance Controls

Cost controls implemented
Resource quota limits configured
Policy compliance verified
Audit logging enabled
Management access reviewed

7.3 Knowledge Transfer

Cross-team impacts documented and communicated
Required training/knowledge transfer completed
Architectural decision records updated
Post-implementation review scheduled
Operations team handover completed

8. CI/CD & DEPLOYMENT

[[LLM: Deployment failures impact everyone. Ensure:

Can we deploy without downtime?
Can we rollback quickly if needed?
Are deployments repeatable and reliable?
Do we test infrastructure changes?
Is the pipeline itself secure?]]

8.1 Pipeline Configuration

CI/CD pipelines configured and tested
Environment promotion strategy defined
Deployment notifications configured
Pipeline security scanning enabled
Artifact management properly configured

8.2 Deployment Strategy

Rollback procedures documented and tested
Zero-downtime deployment strategy implemented
Deployment windows identified and scheduled
Progressive deployment approach used (if applicable)
Feature flags implemented where appropriate

8.3 Verification & Validation

Post-deployment verification tests defined
Smoke tests automated
Configuration validation automated
Integration tests with dependent systems
Canary/blue-green deployment configured (if applicable)

9. NETWORKING & CONNECTIVITY

[[LLM: Network issues are hard to debug. Validate:

Is network segmentation appropriate?
Are we exposing more than necessary?
Can traffic flow where it needs to?
Are we protected from common attacks?
Do we have visibility into network issues?]]

9.1 Network Design

VNet/subnet design follows least-privilege principles
Network security groups rules audited
Public IP addresses minimized and justified
DNS configuration verified
Network diagram updated and accurate

9.2 Connectivity

VNet peering configured correctly
Service endpoints configured where needed
Private link/private endpoints implemented
External connectivity requirements verified
Load balancer configuration verified

9.3 Traffic Management

Inbound/outbound traffic flows documented
Firewall rules reviewed and minimized
Traffic routing optimized
Network monitoring configured
DDoS protection implemented where needed

10. COMPLIANCE & DOCUMENTATION

[[LLM: Compliance failures can shut down operations. Ensure:

Are we meeting all regulatory requirements?
Can we prove compliance to auditors?
Is our documentation actually useful?
Do teams know about these changes?
Will future engineers understand our decisions?]]

10.1 Compliance Verification

Required compliance evidence collected
Non-functional requirements verified
License compliance verified
Third-party dependencies documented
Security posture reviewed

10.2 Documentation Completeness

All documentation updated
Architecture diagrams updated
Technical debt documented (if any accepted)
Cost estimates updated and approved
Capacity planning documented

10.3 Cross-Team Collaboration

Development team impact assessed and communicated
Operations team handover completed
Security team reviews completed
Business stakeholders informed of changes
Feedback loops established for continuous improvement

11. BMAD WORKFLOW INTEGRATION

[[LLM: Infrastructure must support the BMAD development workflow. Check:

Can all dev agents work with this infrastructure?
Does it align with architecture decisions?
Are product requirements actually met?
Can developers be productive?
Are we creating or removing blockers?]]

11.1 Development Agent Alignment

Infrastructure changes support Frontend Dev (Mira) and Fullstack Dev (Enrique) requirements
Backend requirements from Backend Dev (Lily) and Fullstack Dev (Enrique) accommodated
Local development environment compatibility verified for all dev agents
Infrastructure changes support automated testing frameworks
Development agent feedback incorporated into infrastructure design

11.2 Product Alignment

Infrastructure changes mapped to PRD requirements maintained by Product Owner
Non-functional requirements from PRD verified in implementation
Infrastructure capabilities and limitations communicated to Product teams
Infrastructure release timeline aligned with product roadmap
Technical constraints documented and shared with Product Owner

11.3 Architecture Alignment

Infrastructure implementation validated against architecture documentation
Architecture Decision Records (ADRs) reflected in infrastructure
Technical debt identified by Architect addressed or documented
Infrastructure changes support documented design patterns
Performance requirements from architecture verified in implementation

12. ARCHITECTURE DOCUMENTATION VALIDATION

[[LLM: Good architecture docs prevent repeated mistakes. Verify:

Is the documentation complete and current?
Can someone new understand the system?
Are decisions explained with rationale?
Do diagrams match reality?
Is evolution possible without major rewrites?]]

12.1 Completeness Assessment

All required sections of architecture template completed
Architecture decisions documented with clear rationales
Technical diagrams included for all major components
Integration points with application architecture defined
Non-functional requirements addressed with specific solutions

12.2 Consistency Verification

Architecture aligns with broader system architecture
Terminology used consistently throughout documentation
Component relationships clearly defined
Environment differences explicitly documented
No contradictions between different sections

12.3 Stakeholder Usability

Documentation accessible to both technical and non-technical stakeholders
Complex concepts explained with appropriate analogies or examples
Implementation guidance clear for development teams
Operations considerations explicitly addressed
Future evolution pathways documented

13. CONTAINER PLATFORM VALIDATION

[[LLM: Container platforms are complex with many failure modes. Ensure:

Is the cluster secure by default?
Can it handle expected workload?
Are workloads isolated appropriately?
Do we have visibility into container health?
Can we recover from node failures?]]

13.1 Cluster Configuration & Security

Container orchestration platform properly installed and configured
Cluster nodes configured with appropriate resource allocation and security policies
Control plane high availability and security hardening implemented
API server access controls and authentication mechanisms configured
Cluster networking properly configured with security policies

13.2 RBAC & Access Control

Role-Based Access Control (RBAC) implemented with least privilege principles
Service accounts configured with minimal required permissions
Pod security policies and security contexts properly configured
Network policies implemented for micro-segmentation
Secrets management integration configured and validated

13.3 Workload Management & Resource Control

Resource quotas and limits configured per namespace/tenant requirements
Horizontal and vertical pod autoscaling configured and tested
Cluster autoscaling configured for node management
Workload scheduling policies and node affinity rules implemented
Container image security scanning and policy enforcement configured

13.4 Container Platform Operations

Container platform monitoring and observability configured
Container workload logging aggregation implemented
Platform health checks and performance monitoring operational
Backup and disaster recovery procedures for cluster state configured
Operational runbooks and troubleshooting guides created

14. GITOPS WORKFLOWS VALIDATION

[[LLM: GitOps enables reliable deployments. Validate:

Is everything truly declarative?
Can we audit all changes?
Are environments properly isolated?
Can we rollback quickly?
Is drift detected and corrected?]]

14.1 GitOps Operator & Configuration

GitOps operators properly installed and configured
Application and configuration sync controllers operational
Multi-cluster management configured (if required)
Sync policies, retry mechanisms, and conflict resolution configured
Automated pruning and drift detection operational

14.2 Repository Structure & Management

Repository structure follows GitOps best practices
Configuration templating and parameterization properly implemented
Environment-specific configuration overlays configured
Configuration validation and policy enforcement implemented
Version control and branching strategies properly defined

14.3 Environment Promotion & Automation

Environment promotion pipelines operational (dev → staging → prod)
Automated testing and validation gates configured
Approval workflows and change management integration implemented
Automated rollback mechanisms configured and tested
Promotion notifications and audit trails operational

14.4 GitOps Security & Compliance

GitOps security best practices and access controls implemented
Policy enforcement for configurations and deployments operational
Secret management integration with GitOps workflows configured
Security scanning for configuration changes implemented
Audit logging and compliance monitoring configured

15. SERVICE MESH VALIDATION

[[LLM: Service meshes add complexity but enable advanced patterns. Check:

Is the overhead justified by benefits?
Is service communication secure?
Can we debug service issues?
Are failure modes handled gracefully?
Do developers understand the mesh?]]

15.1 Service Mesh Architecture & Installation

Service mesh control plane properly installed and configured
Data plane (sidecars/proxies) deployed and configured correctly
Service mesh components integrated with container platform
Service mesh networking and connectivity validated
Resource allocation and performance tuning for mesh components optimal

15.2 Traffic Management & Communication

Traffic routing rules and policies configured and tested
Load balancing strategies and failover mechanisms operational
Traffic splitting for canary deployments and A/B testing configured
Circuit breakers and retry policies implemented and validated
Timeout and rate limiting policies configured

15.3 Service Mesh Security

Mutual TLS (mTLS) implemented for service-to-service communication
Service-to-service authorization policies configured
Identity and access management integration operational
Network security policies and micro-segmentation implemented
Security audit logging for service mesh events configured

15.4 Service Discovery & Observability

Service discovery mechanisms and service registry integration operational
Advanced load balancing algorithms and health checking configured
Service mesh observability (metrics, logs, traces) implemented
Distributed tracing for service communication operational
Service dependency mapping and topology visualization available

16. DEVELOPER EXPERIENCE PLATFORM VALIDATION

[[LLM: Developer productivity depends on platform usability. Ensure:

Can developers self-serve effectively?
Are golden paths actually helpful?
Is onboarding smooth and quick?
Do developers have the tools they need?
Are we measuring developer satisfaction?]]

16.1 Self-Service Infrastructure

Self-service provisioning for development environments operational
Automated resource provisioning and management configured
Namespace/project provisioning with proper resource limits implemented
Self-service database and storage provisioning available
Automated cleanup and resource lifecycle management operational

16.2 Developer Tooling & Templates

Golden path templates for common application patterns available and tested
Project scaffolding and boilerplate generation operational
Template versioning and update mechanisms configured
Template customization and parameterization working correctly
Template compliance and security scanning implemented

16.3 Platform APIs & Integration

Platform APIs for infrastructure interaction operational and documented
API authentication and authorization properly configured
API documentation and developer resources available and current
Workflow automation and integration capabilities tested
API rate limiting and usage monitoring configured

16.4 Developer Experience & Documentation

Comprehensive developer onboarding documentation available
Interactive tutorials and getting-started guides functional
Developer environment setup automation operational
Access provisioning and permissions management streamlined
Troubleshooting guides and FAQ resources current and accessible

16.5 Productivity & Analytics

Development tool integrations (IDEs, CLI tools) operational
Developer productivity dashboards and metrics implemented
Development workflow optimization tools available
Platform usage monitoring and analytics configured
User feedback collection and analysis mechanisms operational

FINAL INFRASTRUCTURE VALIDATION

[[LLM: COMPREHENSIVE INFRASTRUCTURE REPORT GENERATION

Generate a detailed infrastructure validation report:

Executive Summary
- Overall readiness for production (GO/NO-GO)
- Critical risks identified
- Security posture assessment
- Compliance status
- Estimated reliability (9s of uptime)
Risk Analysis by Category
- CRITICAL: Production blockers
- HIGH: Should fix before production
- MEDIUM: Fix within 30 days
- LOW: Consider for future improvements
Technical Debt Assessment
- Shortcuts taken and their impact
- Future scaling concerns
- Maintenance burden created
- Cost implications
Operational Readiness
- Can the ops team support this?
- Are runbooks complete?
- Is monitoring sufficient?
- Can we meet SLAs?
Security & Compliance Summary
- Security controls effectiveness
- Compliance gaps
- Attack surface analysis
- Data protection status
Platform-Specific Findings
- Container platform readiness
- GitOps maturity
- Service mesh complexity
- Developer experience gaps
Recommendations
- Must-fix before production
- Should-fix for stability
- Consider for optimization
- Future roadmap items

After presenting the report, ask if the user wants:

Deep dive into any failed sections
Risk mitigation strategies
Implementation prioritization help
Specific remediation guidance]]

Prerequisites Verified

All checklist sections reviewed (1-16)
No outstanding critical or high-severity issues
All infrastructure changes tested in non-production environment
Rollback plan documented and tested
Required approvals obtained
Infrastructure changes verified against architectural decisions documented by Architect agent
Development environment impacts identified and mitigated
Infrastructure changes mapped to relevant user stories and epics
Release coordination planned with development teams
Local development environment compatibility verified
Platform component integration validated
Cross-platform functionality tested and verified

25 KiB Raw Blame History