CloudOps & Infrastructure Lifecycle
The objective for technology is to make it easy for your pods to constantly develop and release digital and AI innovations.
AI agents build governed. Humans ship trusted. 80% autonomy, 100% accountability.
This lifecycle follows the 5-stage canonical model per ADR-020. The previous 6-phase Infra model (Discover/Design/Validate/Deploy/Operate/Optimize) maps to 5-canonical as: Discover→Discover, Design→Design, Validate→Build (pre-deploy test sub-step), Deploy→Deploy, Operate→Support & Scale, Optimize→Support & Scale.
Golden Path: From Discovery to Optimized Infrastructure
Stage 1: Discover (10 seconds)
Who: infrastructure-engineer queries org-wide. HITL reviews inventory.
What: Org-wide resource discovery via Config Aggregator — all accounts, one query.
Why: Org-wide before per-account prevents NARROW_SEARCH_SCOPE anti-pattern. Config Aggregator is P1 path.
What-if skip: SINGLE_ACCOUNT_ASSUMPTION — missed resources in backup accounts.
How
/inventory:discover
Output
- Asset register across 67+ accounts
- Resource counts by type and account
- Discovery completed in under 10 seconds
Quality Gate: All accounts visible. Config Aggregator responsive.
Stage 2: Design (1-2 hours)
Who: cloud-architect designs. infrastructure-engineer implements IaC modules.
What: IaC architecture with CDK or Terraform. Docker-first supply chain enforcement.
Why: IaC makes infrastructure reproducible. Docker-first (nnthanh101/* only) ensures supply chain integrity.
What-if skip: Snowflake servers, BARE_METAL_TOOLS violations.
How
/terraform:test # 3-tier testing: functional, integration, E2E
/cdk:synth # CDK synthesis with cdk-nag security checks
Output
- Validated IaC modules (Terraform HCL or CDK TypeScript)
- 3-tier test results (functional/integration/E2E)
- Architecture decision records (ADRs)
Quality Gate: All tests pass. Registry compliance verified.
Stage 3: Build (IaC authoring + pre-deploy validation)
Who: qa-engineer validates IaC. infrastructure-engineer fixes findings and writes tests.
What: IaC tests (terraform test + checkov + tfsec). Pre-deploy cost estimation + security scanning + registry compliance. Code review sub-step [A-gated].
Why: Testing catches security issues before terraform plan. Cost estimation prevents surprise bills. Registry compliance prevents supply-chain violations.
What-if skip: Unchecked deployments, surprise bills, compliance failures, BARE_METAL_TOOLS violations.
How
/terraform:test # 3-tier testing: functional, integration, E2E
/terraform:cost # Infracost pre-deploy estimation
/devcontainer:validate-registry # Docker registry compliance
Output
- 3-tier test results (functional/integration/E2E)
- Cost estimate within budget constraints
- Security scan clean (checkov + tfsec + trivy)
- Registry compliance score (nnthanh101/* only)
Quality Gate: All tests pass. Cost within budget. Zero CRITICAL/HIGH. Registry 100% compliant.
Stage 4: Deploy (HITL gate)
Who: infrastructure-engineer + kubernetes-engineer prepare. HITL approves and commits.
What: GitOps deployment via ArgoCD or ECS with health checks.
Why: Agents prepare, humans decide, humans commit. Principle I: no agent executes terraform apply.
What-if skip: Manual deployments, no rollback, environment drift.
How
/kubernetes:deploy # ArgoCD application sync (agents prepare)
# HITL runs: terraform apply # Human executes after reviewing plan
Output
- Zero-downtime deployment with automated rollback
- Health checks passing across all services
- Deployment evidence in
tmp/
Quality Gate: HITL reviews terraform plan. Health checks green.
Stage 5: Support & Scale (ongoing)
Who: sre-engineer runs READONLY operations (monitor sub-step [A-readonly]). finops-engineer analyzes costs. HITL triages findings and approves remediations ([HITL-decide]).
What: Inventory cross-validation, health event triage, certificate monitoring, decommission scoring, DORA metrics, cost optimization.
Why: Continuous validation prevents drift. Evidence-based decommission eliminates zombie resources. DORA tracks infrastructure velocity.
What-if skip: Invisible drift, expired certificates, zombie resources, unchecked cloud spend.
Operate sub-step (monitor [A-readonly])
/inventory:lz-cross-validate # 4-way cross-validation pipeline
/cloudops:weekly-cert-report # Certificate expiry monitoring
Output: Cross-validated asset register with accuracy deltas. Certificate expiry dashboard (30/60/90 day triage).
Quality Gate: Cross-validation accuracy >=99.5%. Zero expired-in-use certs.
Optimize sub-step (remediate [HITL-decide])
/finops:decommission-inventory # Scream-test scored decommission candidates
/metrics:update-dora # Infrastructure DORA metrics
Output: Decommission candidates with S1-S5 scream-test scores. DORA metrics updated. Cost savings identified.
Quality Gate: Score >=70 flagged for scream test. DORA visible. HITL approves decommission actions.
LEAN/5S Applied to CloudOps
| Principle | Application | Evidence |
|---|---|---|
| Sort | Config Aggregator replaces per-account search | /inventory:discover under 10s |
| Set in Order | 3-tier IaC testing (functional/integration/E2E) | /terraform:test pipeline |
| Shine | Docker-first enforcement — no bare-metal tools | enforce-container-first.sh |
| Standardize | nnthanh101/* registry only — supply chain verified | enforce-docker-registry.sh |
| Sustain | DORA metrics track infrastructure velocity | /metrics:update-dora |
By Persona
Solo CloudOps Engineer
Path: /inventory:discover → /security:cert-inventory → /finops:decommission-inventory
Time to Value: Org-wide inventory in under 10 seconds.
Infrastructure Team Lead
Path: /terraform:test → /devcontainer:validate-registry → /metrics:update-dora
Time to Value: Governed IaC pipeline in 1 day.
Enterprise Cloud Architect
Path: /cdk:synth → /terraform:cost → /inventory:lz-cross-validate
Time to Value: Architecture validation with evidence in 1 hour.
Common Mistakes (Anti-Patterns)
| Mistake | Why It Fails | Fix |
|---|---|---|
| NARROW_SEARCH_SCOPE | Per-account search misses resources | Config Aggregator org-wide first |
| SINGLE_ACCOUNT_ASSUMPTION | Trusting task-provided account ID | Phase 0 org-wide discovery |
| BARE_METAL_TOOLS | Running terraform on host | enforce-container-first.sh |
| REBOOT_FIRST_DECOMMISSION_SECOND | Treating symptom not disease | Scream-test score before maintenance |
| EXISTENCE_WITHOUT_ACTIVITY | Finding resources without checking activity | Validate last-modified/flow-logs |
| PUSH_WITHOUT_LOCAL_CI | Pushing workflows without local validation | task ci:lint-workflows first |
Quick Reference: Command Cheat Sheet
# Discover (org-wide)
/inventory:discover
# Design + Test IaC
/terraform:test
/cdk:synth
# Validate cost + registry
/terraform:cost
/devcontainer:validate-registry
# Operate
/inventory:lz-cross-validate
/cloudops:weekly-cert-report
# Optimize
/finops:decommission-inventory
/metrics:update-dora
Agent Team
| Agent | Role in This Path | Stage | Talent Bench |
|---|---|---|---|
| infrastructure-engineer | CDK/Terraform code generation + IaC execution | Build/Deploy | Profile |
| kubernetes-engineer | K3s/K8s cluster lifecycle + container orchestration | Deploy/Support & Scale | Profile |
| devops-security-engineer | IaC security scanning (checkov, tfsec) + hardening | Build | Profile |
| cloud-architect | Multi-cloud architecture design + cost progression validation | Design/Support & Scale | Profile |
| qa-automation-engineer | 3-tier testing (snapshot→LocalStack→AWS) + regression validation | Build/Deploy | Profile |
7 Skills Coverage
| Skill | Coverage in This Path | Implementation |
|---|---|---|
| S1 System Design | CDK→Terraform→K8s progression from design to production | IaC architecture, module patterns, cloud-agnostic abstraction |
| S2 Tool Design | Terraform schemas + CDK construct validation + kubectl manifest schemas | Tool integration, input validation, error messages |
| S3 Retrieval | AWS/Azure provider docs, CloudFormation resource docs via context7 | Documentation lookups, API contract discovery, schema reference |
| S4 Reliability | terraform state locking + retry config on API calls + dry-run validation | State consistency, mutation safety, preview before apply |
| S5 Security | checkov/tfsec security scanning (SAST), docker-registry validation (supply chain), RBAC enforcement | Shift-left security, registry allowlist, access control |
| S6 Evaluation | 3-tier testing: snapshot tests (fast), LocalStack (integration), real AWS (E2E) | Progressive validation, early failure detection, production confidence |
| S7 Product Thinking | Cost progression visualization: LOCAL ($0) → DEV ($20) → STAGING ($80) → PROD ($180), team self-service infrastructure as products | Cost awareness, team autonomy, business impact (TTM reduction) |
Last Updated: March 2026 | Status: Active | Maintenance: infrastructure-engineer