CloudOps & Infrastructure Lifecycle
The objective for technology is to make it easy for your pods to constantly develop and release digital and AI innovations.
AI agents build governed. Humans ship trusted. 80% autonomy, 100% accountability.
Golden Path: From Discovery to Optimized Infrastructure
Phase 1: Discover (10 seconds)
Who: infrastructure-engineer queries org-wide. HITL reviews inventory.
What: Org-wide resource discovery via Config Aggregator — all accounts, one query.
Why: Org-wide before per-account prevents NARROW_SEARCH_SCOPE anti-pattern. Config Aggregator is P1 path.
What-if skip: SINGLE_ACCOUNT_ASSUMPTION — missed resources in backup accounts.
How
/inventory:discover
Output
- Asset register across 67+ accounts
- Resource counts by type and account
- Discovery completed in under 10 seconds
Quality Gate: All accounts visible. Config Aggregator responsive.
Phase 2: Design (1-2 hours)
Who: cloud-architect designs. infrastructure-engineer implements IaC modules.
What: IaC architecture with CDK or Terraform. Docker-first supply chain enforcement.
Why: IaC makes infrastructure reproducible. Docker-first (nnthanh101/* only) ensures supply chain integrity.
What-if skip: Snowflake servers, BARE_METAL_TOOLS violations.
How
/terraform:test # 3-tier testing: functional, integration, E2E
/cdk:synth # CDK synthesis with cdk-nag security checks
Output
- Validated IaC modules (Terraform HCL or CDK TypeScript)
- 3-tier test results (functional/integration/E2E)
- Architecture decision records (ADRs)
Quality Gate: All tests pass. Registry compliance verified.
Phase 3: Validate (30 min)
Who: qa-engineer validates. infrastructure-engineer fixes findings.
What: Pre-deploy cost estimation + security scanning + registry compliance.
Why: Testing catches security issues before terraform plan. Cost estimation prevents surprise bills.
What-if skip: Unchecked deployments, surprise bills, compliance failures.
How
/terraform:cost # Infracost pre-deploy estimation
/devcontainer:validate-registry # Docker registry compliance
Output
- Cost estimate within budget constraints
- Security scan clean (checkov + tfsec + trivy)
- Registry compliance score (nnthanh101/* only)
Quality Gate: Cost within budget. Zero CRITICAL/HIGH. Registry 100% compliant.
Phase 4: Deploy (HITL gate)
Who: infrastructure-engineer + kubernetes-engineer prepare. HITL approves and commits.
What: GitOps deployment via ArgoCD or ECS with health checks.
Why: Agents prepare, humans decide, humans commit. Principle I: no agent executes terraform apply.
What-if skip: Manual deployments, no rollback, environment drift.
How
/kubernetes:deploy # ArgoCD application sync (agents prepare)
# HITL runs: terraform apply # Human executes after reviewing plan
Output
- Zero-downtime deployment with automated rollback
- Health checks passing across all services
- Deployment evidence in
tmp/
Quality Gate: HITL reviews terraform plan. Health checks green.
Phase 5: Operate (ongoing)
Who: sre-automation-specialist runs READONLY operations. HITL triages findings.
What: Inventory cross-validation, health event triage, certificate monitoring.
Why: Continuous validation prevents drift. Health events triaged proactively.
What-if skip: Invisible drift, expired certificates, unmanaged resources.
How
/inventory:lz-cross-validate # 4-way cross-validation pipeline
/cloudops:weekly-cert-report # Certificate expiry monitoring
Output
- Cross-validated asset register with accuracy deltas
- Certificate expiry dashboard (30/60/90 day triage)
- Health event investigation evidence
Quality Gate: Cross-validation accuracy >=99.5%. Zero expired-in-use certs.
Phase 6: Optimize (per sprint)
Who: gitops-cost-optimizer analyzes. HITL decides decommission actions.
What: Decommission unused resources, rightsize, track infrastructure DORA metrics.
Why: Evidence-based decommission. REBOOT_FIRST_DECOMMISSION_SECOND anti-pattern eliminated.
What-if skip: Zombie resources persist, cloud spend grows unchecked.
How
/finops:decommission-inventory # Scream-test scored decommission candidates
/metrics:update-dora # Infrastructure DORA metrics
Output
- Decommission candidates with S1-S5 scream-test scores
- DORA metrics updated (deploy frequency, lead time, CFR, MTTR)
- Cost savings identified and attributed
Quality Gate: Score >=70 flagged for scream test. DORA visible.
LEAN/5S Applied to CloudOps
| Principle | Application | Evidence |
|---|---|---|
| Sort | Config Aggregator replaces per-account search | /inventory:discover under 10s |
| Set in Order | 3-tier IaC testing (functional/integration/E2E) | /terraform:test pipeline |
| Shine | Docker-first enforcement — no bare-metal tools | enforce-container-first.sh |
| Standardize | nnthanh101/* registry only — supply chain verified | enforce-docker-registry.sh |
| Sustain | DORA metrics track infrastructure velocity | /metrics:update-dora |
By Persona
Solo CloudOps Engineer
Path: /inventory:discover → /security:cert-inventory → /finops:decommission-inventory
Time to Value: Org-wide inventory in under 10 seconds.
Infrastructure Team Lead
Path: /terraform:test → /devcontainer:validate-registry → /metrics:update-dora
Time to Value: Governed IaC pipeline in 1 day.
Enterprise Cloud Architect
Path: /cdk:synth → /terraform:cost → /inventory:lz-cross-validate
Time to Value: Architecture validation with evidence in 1 hour.
Common Mistakes (Anti-Patterns)
| Mistake | Why It Fails | Fix |
|---|---|---|
| NARROW_SEARCH_SCOPE | Per-account search misses resources | Config Aggregator org-wide first |
| SINGLE_ACCOUNT_ASSUMPTION | Trusting task-provided account ID | Phase 0 org-wide discovery |
| BARE_METAL_TOOLS | Running terraform on host | enforce-container-first.sh |
| REBOOT_FIRST_DECOMMISSION_SECOND | Treating symptom not disease | Scream-test score before maintenance |
| EXISTENCE_WITHOUT_ACTIVITY | Finding resources without checking activity | Validate last-modified/flow-logs |
| PUSH_WITHOUT_LOCAL_CI | Pushing workflows without local validation | task ci:lint-workflows first |
Quick Reference: Command Cheat Sheet
# Discover (org-wide)
/inventory:discover
# Design + Test IaC
/terraform:test
/cdk:synth
# Validate cost + registry
/terraform:cost
/devcontainer:validate-registry
# Operate
/inventory:lz-cross-validate
/cloudops:weekly-cert-report
# Optimize
/finops:decommission-inventory
/metrics:update-dora
Last Updated: March 2026 | Status: Active | Maintenance: infrastructure-engineer