checklist standardization and improvement with llm eliciatation

This commit is contained in:
Brian Madison
2025-06-08 20:34:07 -05:00
parent 460c47f5c8
commit 8788c1d20f
24 changed files with 2419 additions and 884 deletions

View File

@@ -2,8 +2,45 @@
This checklist serves as a comprehensive framework for validating infrastructure changes before deployment to production. The DevOps/Platform Engineer should systematically work through each item, ensuring the infrastructure is secure, compliant, resilient, and properly implemented according to organizational standards.
[[LLM: INITIALIZATION INSTRUCTIONS - INFRASTRUCTURE VALIDATION
Before proceeding with this checklist, ensure you have access to:
1. platform-architecture.md or infrastructure-architecture.md (check docs/platform-architecture.md)
2. Infrastructure as Code files (Terraform, CloudFormation, Bicep, etc.)
3. CI/CD pipeline configurations
4. Security and compliance requirements
5. Network diagrams and configurations
6. Monitoring and alerting specifications
IMPORTANT: Infrastructure failures can cause complete outages. This checklist must be thorough.
VALIDATION PRINCIPLES:
1. Security First - Every decision should consider security implications
2. Automation - Manual processes are error-prone and don't scale
3. Resilience - Assume everything will fail and plan accordingly
4. Compliance - Regulatory requirements are non-negotiable
5. Cost Awareness - Over-provisioning wastes money, under-provisioning causes outages
EXECUTION MODE:
Ask the user if they want to work through the checklist:
- Section by section (interactive mode) - Deep dive into each area
- All at once (comprehensive mode) - Complete analysis with summary report
REMEMBER: Production infrastructure supports real users and business operations. Mistakes here have immediate, visible impact.]]
## 1. SECURITY & COMPLIANCE
[[LLM: Security breaches destroy trust and businesses. For each item:
1. Verify implementation, not just documentation
2. Check for common vulnerabilities (default passwords, open ports, etc.)
3. Ensure compliance requirements are actually met, not just considered
4. Look for defense in depth - multiple layers of security
5. Consider the blast radius if this security control fails]]
### 1.1 Access Management
- [ ] RBAC principles applied with least privilege access
@@ -38,6 +75,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
## 2. INFRASTRUCTURE AS CODE
[[LLM: IaC prevents configuration drift and enables disaster recovery. Verify:
1. EVERYTHING is in code - no "just this once" manual changes
2. Code quality matches application code standards
3. State management won't cause conflicts or data loss
4. Changes can be rolled back safely
5. New team members can understand and modify the infrastructure]]
### 2.1 IaC Implementation
- [ ] All resources defined in IaC (Terraform/Bicep/ARM)
@@ -64,6 +109,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
## 3. RESILIENCE & AVAILABILITY
[[LLM: Downtime costs money and reputation. Check:
1. What happens when each component fails?
2. Are we meeting our SLA commitments?
3. Has resilience been tested, not just designed?
4. Can the system handle expected peak load?
5. Are failure modes graceful or catastrophic?]]
### 3.1 High Availability
- [ ] Resources deployed across appropriate availability zones
@@ -90,6 +143,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
## 4. BACKUP & DISASTER RECOVERY
[[LLM: Backups are worthless if they don't restore. Validate:
1. Have restores been tested recently?
2. Do backup windows meet business needs?
3. Are backups stored in a different failure domain?
4. Can we meet our RTO/RPO commitments?
5. Who has tested the disaster recovery runbook?]]
### 4.1 Backup Strategy
- [ ] Backup strategy defined and implemented
@@ -116,6 +177,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
## 5. MONITORING & OBSERVABILITY
[[LLM: You can't fix what you can't see. Ensure:
1. Every critical metric has monitoring
2. Alerts fire BEFORE users complain
3. Logs are searchable and retained appropriately
4. Dashboards show what actually matters
5. Someone knows how to interpret the data]]
### 5.1 Monitoring Implementation
- [ ] Monitoring coverage for all critical components
@@ -142,6 +211,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
## 6. PERFORMANCE & OPTIMIZATION
[[LLM: Performance impacts user experience and costs. Check:
1. Has performance been tested under realistic load?
2. Are we over-provisioned (wasting money)?
3. Are we under-provisioned (risking outages)?
4. Do we know our breaking point?
5. Is autoscaling configured correctly?]]
### 6.1 Performance Testing
- [ ] Performance testing completed and baseline established
@@ -168,6 +245,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
## 7. OPERATIONS & GOVERNANCE
[[LLM: Good operations prevent 3am emergencies. Verify:
1. Can a new team member understand the system?
2. Are runbooks tested and current?
3. Do we know who owns what?
4. Are costs tracked and controlled?
5. Will auditors be satisfied?]]
### 7.1 Documentation
- [ ] Change documentation updated
@@ -194,6 +279,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
## 8. CI/CD & DEPLOYMENT
[[LLM: Deployment failures impact everyone. Ensure:
1. Can we deploy without downtime?
2. Can we rollback quickly if needed?
3. Are deployments repeatable and reliable?
4. Do we test infrastructure changes?
5. Is the pipeline itself secure?]]
### 8.1 Pipeline Configuration
- [ ] CI/CD pipelines configured and tested
@@ -220,6 +313,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
## 9. NETWORKING & CONNECTIVITY
[[LLM: Network issues are hard to debug. Validate:
1. Is network segmentation appropriate?
2. Are we exposing more than necessary?
3. Can traffic flow where it needs to?
4. Are we protected from common attacks?
5. Do we have visibility into network issues?]]
### 9.1 Network Design
- [ ] VNet/subnet design follows least-privilege principles
@@ -246,6 +347,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
## 10. COMPLIANCE & DOCUMENTATION
[[LLM: Compliance failures can shut down operations. Ensure:
1. Are we meeting all regulatory requirements?
2. Can we prove compliance to auditors?
3. Is our documentation actually useful?
4. Do teams know about these changes?
5. Will future engineers understand our decisions?]]
### 10.1 Compliance Verification
- [ ] Required compliance evidence collected
@@ -272,6 +381,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
## 11. BMAD WORKFLOW INTEGRATION
[[LLM: Infrastructure must support the BMAD development workflow. Check:
1. Can all dev agents work with this infrastructure?
2. Does it align with architecture decisions?
3. Are product requirements actually met?
4. Can developers be productive?
5. Are we creating or removing blockers?]]
### 11.1 Development Agent Alignment
- [ ] Infrastructure changes support Frontend Dev (Mira) and Fullstack Dev (Enrique) requirements
@@ -298,6 +415,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
## 12. ARCHITECTURE DOCUMENTATION VALIDATION
[[LLM: Good architecture docs prevent repeated mistakes. Verify:
1. Is the documentation complete and current?
2. Can someone new understand the system?
3. Are decisions explained with rationale?
4. Do diagrams match reality?
5. Is evolution possible without major rewrites?]]
### 12.1 Completeness Assessment
- [ ] All required sections of architecture template completed
@@ -324,6 +449,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
## 13. CONTAINER PLATFORM VALIDATION
[[LLM: Container platforms are complex with many failure modes. Ensure:
1. Is the cluster secure by default?
2. Can it handle expected workload?
3. Are workloads isolated appropriately?
4. Do we have visibility into container health?
5. Can we recover from node failures?]]
### 13.1 Cluster Configuration & Security
- [ ] Container orchestration platform properly installed and configured
@@ -358,6 +491,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
## 14. GITOPS WORKFLOWS VALIDATION
[[LLM: GitOps enables reliable deployments. Validate:
1. Is everything truly declarative?
2. Can we audit all changes?
3. Are environments properly isolated?
4. Can we rollback quickly?
5. Is drift detected and corrected?]]
### 14.1 GitOps Operator & Configuration
- [ ] GitOps operators properly installed and configured
@@ -392,6 +533,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
## 15. SERVICE MESH VALIDATION
[[LLM: Service meshes add complexity but enable advanced patterns. Check:
1. Is the overhead justified by benefits?
2. Is service communication secure?
3. Can we debug service issues?
4. Are failure modes handled gracefully?
5. Do developers understand the mesh?]]
### 15.1 Service Mesh Architecture & Installation
- [ ] Service mesh control plane properly installed and configured
@@ -426,6 +575,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
## 16. DEVELOPER EXPERIENCE PLATFORM VALIDATION
[[LLM: Developer productivity depends on platform usability. Ensure:
1. Can developers self-serve effectively?
2. Are golden paths actually helpful?
3. Is onboarding smooth and quick?
4. Do developers have the tools they need?
5. Are we measuring developer satisfaction?]]
### 16.1 Self-Service Infrastructure
- [ ] Self-service provisioning for development environments operational
@@ -468,6 +625,68 @@ This checklist serves as a comprehensive framework for validating infrastructure
---
## FINAL INFRASTRUCTURE VALIDATION
[[LLM: COMPREHENSIVE INFRASTRUCTURE REPORT GENERATION
Generate a detailed infrastructure validation report:
1. Executive Summary
- Overall readiness for production (GO/NO-GO)
- Critical risks identified
- Security posture assessment
- Compliance status
- Estimated reliability (9s of uptime)
2. Risk Analysis by Category
- CRITICAL: Production blockers
- HIGH: Should fix before production
- MEDIUM: Fix within 30 days
- LOW: Consider for future improvements
3. Technical Debt Assessment
- Shortcuts taken and their impact
- Future scaling concerns
- Maintenance burden created
- Cost implications
4. Operational Readiness
- Can the ops team support this?
- Are runbooks complete?
- Is monitoring sufficient?
- Can we meet SLAs?
5. Security & Compliance Summary
- Security controls effectiveness
- Compliance gaps
- Attack surface analysis
- Data protection status
6. Platform-Specific Findings
- Container platform readiness
- GitOps maturity
- Service mesh complexity
- Developer experience gaps
7. Recommendations
- Must-fix before production
- Should-fix for stability
- Consider for optimization
- Future roadmap items
After presenting the report, ask if the user wants:
- Deep dive into any failed sections
- Risk mitigation strategies
- Implementation prioritization help
- Specific remediation guidance]]
### Prerequisites Verified
- [ ] All checklist sections reviewed (1-16)