checklist standardization and improvement with llm eliciatation
This commit is contained in:
@@ -2,8 +2,45 @@
|
||||
|
||||
This checklist serves as a comprehensive framework for validating infrastructure changes before deployment to production. The DevOps/Platform Engineer should systematically work through each item, ensuring the infrastructure is secure, compliant, resilient, and properly implemented according to organizational standards.
|
||||
|
||||
[[LLM: INITIALIZATION INSTRUCTIONS - INFRASTRUCTURE VALIDATION
|
||||
|
||||
Before proceeding with this checklist, ensure you have access to:
|
||||
|
||||
1. platform-architecture.md or infrastructure-architecture.md (check docs/platform-architecture.md)
|
||||
2. Infrastructure as Code files (Terraform, CloudFormation, Bicep, etc.)
|
||||
3. CI/CD pipeline configurations
|
||||
4. Security and compliance requirements
|
||||
5. Network diagrams and configurations
|
||||
6. Monitoring and alerting specifications
|
||||
|
||||
IMPORTANT: Infrastructure failures can cause complete outages. This checklist must be thorough.
|
||||
|
||||
VALIDATION PRINCIPLES:
|
||||
|
||||
1. Security First - Every decision should consider security implications
|
||||
2. Automation - Manual processes are error-prone and don't scale
|
||||
3. Resilience - Assume everything will fail and plan accordingly
|
||||
4. Compliance - Regulatory requirements are non-negotiable
|
||||
5. Cost Awareness - Over-provisioning wastes money, under-provisioning causes outages
|
||||
|
||||
EXECUTION MODE:
|
||||
Ask the user if they want to work through the checklist:
|
||||
|
||||
- Section by section (interactive mode) - Deep dive into each area
|
||||
- All at once (comprehensive mode) - Complete analysis with summary report
|
||||
|
||||
REMEMBER: Production infrastructure supports real users and business operations. Mistakes here have immediate, visible impact.]]
|
||||
|
||||
## 1. SECURITY & COMPLIANCE
|
||||
|
||||
[[LLM: Security breaches destroy trust and businesses. For each item:
|
||||
|
||||
1. Verify implementation, not just documentation
|
||||
2. Check for common vulnerabilities (default passwords, open ports, etc.)
|
||||
3. Ensure compliance requirements are actually met, not just considered
|
||||
4. Look for defense in depth - multiple layers of security
|
||||
5. Consider the blast radius if this security control fails]]
|
||||
|
||||
### 1.1 Access Management
|
||||
|
||||
- [ ] RBAC principles applied with least privilege access
|
||||
@@ -38,6 +75,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
|
||||
|
||||
## 2. INFRASTRUCTURE AS CODE
|
||||
|
||||
[[LLM: IaC prevents configuration drift and enables disaster recovery. Verify:
|
||||
|
||||
1. EVERYTHING is in code - no "just this once" manual changes
|
||||
2. Code quality matches application code standards
|
||||
3. State management won't cause conflicts or data loss
|
||||
4. Changes can be rolled back safely
|
||||
5. New team members can understand and modify the infrastructure]]
|
||||
|
||||
### 2.1 IaC Implementation
|
||||
|
||||
- [ ] All resources defined in IaC (Terraform/Bicep/ARM)
|
||||
@@ -64,6 +109,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
|
||||
|
||||
## 3. RESILIENCE & AVAILABILITY
|
||||
|
||||
[[LLM: Downtime costs money and reputation. Check:
|
||||
|
||||
1. What happens when each component fails?
|
||||
2. Are we meeting our SLA commitments?
|
||||
3. Has resilience been tested, not just designed?
|
||||
4. Can the system handle expected peak load?
|
||||
5. Are failure modes graceful or catastrophic?]]
|
||||
|
||||
### 3.1 High Availability
|
||||
|
||||
- [ ] Resources deployed across appropriate availability zones
|
||||
@@ -90,6 +143,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
|
||||
|
||||
## 4. BACKUP & DISASTER RECOVERY
|
||||
|
||||
[[LLM: Backups are worthless if they don't restore. Validate:
|
||||
|
||||
1. Have restores been tested recently?
|
||||
2. Do backup windows meet business needs?
|
||||
3. Are backups stored in a different failure domain?
|
||||
4. Can we meet our RTO/RPO commitments?
|
||||
5. Who has tested the disaster recovery runbook?]]
|
||||
|
||||
### 4.1 Backup Strategy
|
||||
|
||||
- [ ] Backup strategy defined and implemented
|
||||
@@ -116,6 +177,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
|
||||
|
||||
## 5. MONITORING & OBSERVABILITY
|
||||
|
||||
[[LLM: You can't fix what you can't see. Ensure:
|
||||
|
||||
1. Every critical metric has monitoring
|
||||
2. Alerts fire BEFORE users complain
|
||||
3. Logs are searchable and retained appropriately
|
||||
4. Dashboards show what actually matters
|
||||
5. Someone knows how to interpret the data]]
|
||||
|
||||
### 5.1 Monitoring Implementation
|
||||
|
||||
- [ ] Monitoring coverage for all critical components
|
||||
@@ -142,6 +211,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
|
||||
|
||||
## 6. PERFORMANCE & OPTIMIZATION
|
||||
|
||||
[[LLM: Performance impacts user experience and costs. Check:
|
||||
|
||||
1. Has performance been tested under realistic load?
|
||||
2. Are we over-provisioned (wasting money)?
|
||||
3. Are we under-provisioned (risking outages)?
|
||||
4. Do we know our breaking point?
|
||||
5. Is autoscaling configured correctly?]]
|
||||
|
||||
### 6.1 Performance Testing
|
||||
|
||||
- [ ] Performance testing completed and baseline established
|
||||
@@ -168,6 +245,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
|
||||
|
||||
## 7. OPERATIONS & GOVERNANCE
|
||||
|
||||
[[LLM: Good operations prevent 3am emergencies. Verify:
|
||||
|
||||
1. Can a new team member understand the system?
|
||||
2. Are runbooks tested and current?
|
||||
3. Do we know who owns what?
|
||||
4. Are costs tracked and controlled?
|
||||
5. Will auditors be satisfied?]]
|
||||
|
||||
### 7.1 Documentation
|
||||
|
||||
- [ ] Change documentation updated
|
||||
@@ -194,6 +279,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
|
||||
|
||||
## 8. CI/CD & DEPLOYMENT
|
||||
|
||||
[[LLM: Deployment failures impact everyone. Ensure:
|
||||
|
||||
1. Can we deploy without downtime?
|
||||
2. Can we rollback quickly if needed?
|
||||
3. Are deployments repeatable and reliable?
|
||||
4. Do we test infrastructure changes?
|
||||
5. Is the pipeline itself secure?]]
|
||||
|
||||
### 8.1 Pipeline Configuration
|
||||
|
||||
- [ ] CI/CD pipelines configured and tested
|
||||
@@ -220,6 +313,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
|
||||
|
||||
## 9. NETWORKING & CONNECTIVITY
|
||||
|
||||
[[LLM: Network issues are hard to debug. Validate:
|
||||
|
||||
1. Is network segmentation appropriate?
|
||||
2. Are we exposing more than necessary?
|
||||
3. Can traffic flow where it needs to?
|
||||
4. Are we protected from common attacks?
|
||||
5. Do we have visibility into network issues?]]
|
||||
|
||||
### 9.1 Network Design
|
||||
|
||||
- [ ] VNet/subnet design follows least-privilege principles
|
||||
@@ -246,6 +347,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
|
||||
|
||||
## 10. COMPLIANCE & DOCUMENTATION
|
||||
|
||||
[[LLM: Compliance failures can shut down operations. Ensure:
|
||||
|
||||
1. Are we meeting all regulatory requirements?
|
||||
2. Can we prove compliance to auditors?
|
||||
3. Is our documentation actually useful?
|
||||
4. Do teams know about these changes?
|
||||
5. Will future engineers understand our decisions?]]
|
||||
|
||||
### 10.1 Compliance Verification
|
||||
|
||||
- [ ] Required compliance evidence collected
|
||||
@@ -272,6 +381,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
|
||||
|
||||
## 11. BMAD WORKFLOW INTEGRATION
|
||||
|
||||
[[LLM: Infrastructure must support the BMAD development workflow. Check:
|
||||
|
||||
1. Can all dev agents work with this infrastructure?
|
||||
2. Does it align with architecture decisions?
|
||||
3. Are product requirements actually met?
|
||||
4. Can developers be productive?
|
||||
5. Are we creating or removing blockers?]]
|
||||
|
||||
### 11.1 Development Agent Alignment
|
||||
|
||||
- [ ] Infrastructure changes support Frontend Dev (Mira) and Fullstack Dev (Enrique) requirements
|
||||
@@ -298,6 +415,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
|
||||
|
||||
## 12. ARCHITECTURE DOCUMENTATION VALIDATION
|
||||
|
||||
[[LLM: Good architecture docs prevent repeated mistakes. Verify:
|
||||
|
||||
1. Is the documentation complete and current?
|
||||
2. Can someone new understand the system?
|
||||
3. Are decisions explained with rationale?
|
||||
4. Do diagrams match reality?
|
||||
5. Is evolution possible without major rewrites?]]
|
||||
|
||||
### 12.1 Completeness Assessment
|
||||
|
||||
- [ ] All required sections of architecture template completed
|
||||
@@ -324,6 +449,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
|
||||
|
||||
## 13. CONTAINER PLATFORM VALIDATION
|
||||
|
||||
[[LLM: Container platforms are complex with many failure modes. Ensure:
|
||||
|
||||
1. Is the cluster secure by default?
|
||||
2. Can it handle expected workload?
|
||||
3. Are workloads isolated appropriately?
|
||||
4. Do we have visibility into container health?
|
||||
5. Can we recover from node failures?]]
|
||||
|
||||
### 13.1 Cluster Configuration & Security
|
||||
|
||||
- [ ] Container orchestration platform properly installed and configured
|
||||
@@ -358,6 +491,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
|
||||
|
||||
## 14. GITOPS WORKFLOWS VALIDATION
|
||||
|
||||
[[LLM: GitOps enables reliable deployments. Validate:
|
||||
|
||||
1. Is everything truly declarative?
|
||||
2. Can we audit all changes?
|
||||
3. Are environments properly isolated?
|
||||
4. Can we rollback quickly?
|
||||
5. Is drift detected and corrected?]]
|
||||
|
||||
### 14.1 GitOps Operator & Configuration
|
||||
|
||||
- [ ] GitOps operators properly installed and configured
|
||||
@@ -392,6 +533,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
|
||||
|
||||
## 15. SERVICE MESH VALIDATION
|
||||
|
||||
[[LLM: Service meshes add complexity but enable advanced patterns. Check:
|
||||
|
||||
1. Is the overhead justified by benefits?
|
||||
2. Is service communication secure?
|
||||
3. Can we debug service issues?
|
||||
4. Are failure modes handled gracefully?
|
||||
5. Do developers understand the mesh?]]
|
||||
|
||||
### 15.1 Service Mesh Architecture & Installation
|
||||
|
||||
- [ ] Service mesh control plane properly installed and configured
|
||||
@@ -426,6 +575,14 @@ This checklist serves as a comprehensive framework for validating infrastructure
|
||||
|
||||
## 16. DEVELOPER EXPERIENCE PLATFORM VALIDATION
|
||||
|
||||
[[LLM: Developer productivity depends on platform usability. Ensure:
|
||||
|
||||
1. Can developers self-serve effectively?
|
||||
2. Are golden paths actually helpful?
|
||||
3. Is onboarding smooth and quick?
|
||||
4. Do developers have the tools they need?
|
||||
5. Are we measuring developer satisfaction?]]
|
||||
|
||||
### 16.1 Self-Service Infrastructure
|
||||
|
||||
- [ ] Self-service provisioning for development environments operational
|
||||
@@ -468,6 +625,68 @@ This checklist serves as a comprehensive framework for validating infrastructure
|
||||
|
||||
---
|
||||
|
||||
## FINAL INFRASTRUCTURE VALIDATION
|
||||
|
||||
[[LLM: COMPREHENSIVE INFRASTRUCTURE REPORT GENERATION
|
||||
|
||||
Generate a detailed infrastructure validation report:
|
||||
|
||||
1. Executive Summary
|
||||
|
||||
- Overall readiness for production (GO/NO-GO)
|
||||
- Critical risks identified
|
||||
- Security posture assessment
|
||||
- Compliance status
|
||||
- Estimated reliability (9s of uptime)
|
||||
|
||||
2. Risk Analysis by Category
|
||||
|
||||
- CRITICAL: Production blockers
|
||||
- HIGH: Should fix before production
|
||||
- MEDIUM: Fix within 30 days
|
||||
- LOW: Consider for future improvements
|
||||
|
||||
3. Technical Debt Assessment
|
||||
|
||||
- Shortcuts taken and their impact
|
||||
- Future scaling concerns
|
||||
- Maintenance burden created
|
||||
- Cost implications
|
||||
|
||||
4. Operational Readiness
|
||||
|
||||
- Can the ops team support this?
|
||||
- Are runbooks complete?
|
||||
- Is monitoring sufficient?
|
||||
- Can we meet SLAs?
|
||||
|
||||
5. Security & Compliance Summary
|
||||
|
||||
- Security controls effectiveness
|
||||
- Compliance gaps
|
||||
- Attack surface analysis
|
||||
- Data protection status
|
||||
|
||||
6. Platform-Specific Findings
|
||||
|
||||
- Container platform readiness
|
||||
- GitOps maturity
|
||||
- Service mesh complexity
|
||||
- Developer experience gaps
|
||||
|
||||
7. Recommendations
|
||||
- Must-fix before production
|
||||
- Should-fix for stability
|
||||
- Consider for optimization
|
||||
- Future roadmap items
|
||||
|
||||
After presenting the report, ask if the user wants:
|
||||
|
||||
- Deep dive into any failed sections
|
||||
- Risk mitigation strategies
|
||||
- Implementation prioritization help
|
||||
- Specific remediation guidance]]
|
||||
|
||||
### Prerequisites Verified
|
||||
|
||||
- [ ] All checklist sections reviewed (1-16)
|
||||
|
||||
Reference in New Issue
Block a user