Skip to main content

CloudOps & Infrastructure Lifecycle

The objective for technology is to make it easy for your pods to constantly develop and release digital and AI innovations.

AI agents build governed. Humans ship trusted. 80% autonomy, 100% accountability.

This lifecycle follows the 5-stage canonical model per ADR-020. The previous 6-phase Infra model (Discover/Design/Validate/Deploy/Operate/Optimize) maps to 5-canonical as: Discover→Discover, Design→Design, Validate→Build (pre-deploy test sub-step), Deploy→Deploy, Operate→Support & Scale, Optimize→Support & Scale.

Golden Path: From Discovery to Optimized Infrastructure


Stage 1: Discover (10 seconds)

Who: infrastructure-engineer queries org-wide. HITL reviews inventory.

What: Org-wide resource discovery via Config Aggregator — all accounts, one query.

Why: Org-wide before per-account prevents NARROW_SEARCH_SCOPE anti-pattern. Config Aggregator is P1 path.

What-if skip: SINGLE_ACCOUNT_ASSUMPTION — missed resources in backup accounts.

How

/inventory:discover

Output

  • Asset register across 67+ accounts
  • Resource counts by type and account
  • Discovery completed in under 10 seconds

Quality Gate: All accounts visible. Config Aggregator responsive.


Stage 2: Design (1-2 hours)

Who: cloud-architect designs. infrastructure-engineer implements IaC modules.

What: IaC architecture with CDK or Terraform. Docker-first supply chain enforcement.

Why: IaC makes infrastructure reproducible. Docker-first (nnthanh101/* only) ensures supply chain integrity.

What-if skip: Snowflake servers, BARE_METAL_TOOLS violations.

How

/terraform:test    # 3-tier testing: functional, integration, E2E
/cdk:synth # CDK synthesis with cdk-nag security checks

Output

  • Validated IaC modules (Terraform HCL or CDK TypeScript)
  • 3-tier test results (functional/integration/E2E)
  • Architecture decision records (ADRs)

Quality Gate: All tests pass. Registry compliance verified.


Stage 3: Build (IaC authoring + pre-deploy validation)

Who: qa-engineer validates IaC. infrastructure-engineer fixes findings and writes tests.

What: IaC tests (terraform test + checkov + tfsec). Pre-deploy cost estimation + security scanning + registry compliance. Code review sub-step [A-gated].

Why: Testing catches security issues before terraform plan. Cost estimation prevents surprise bills. Registry compliance prevents supply-chain violations.

What-if skip: Unchecked deployments, surprise bills, compliance failures, BARE_METAL_TOOLS violations.

How

/terraform:test                  # 3-tier testing: functional, integration, E2E
/terraform:cost # Infracost pre-deploy estimation
/devcontainer:validate-registry # Docker registry compliance

Output

  • 3-tier test results (functional/integration/E2E)
  • Cost estimate within budget constraints
  • Security scan clean (checkov + tfsec + trivy)
  • Registry compliance score (nnthanh101/* only)

Quality Gate: All tests pass. Cost within budget. Zero CRITICAL/HIGH. Registry 100% compliant.


Stage 4: Deploy (HITL gate)

Who: infrastructure-engineer + kubernetes-engineer prepare. HITL approves and commits.

What: GitOps deployment via ArgoCD or ECS with health checks.

Why: Agents prepare, humans decide, humans commit. Principle I: no agent executes terraform apply.

What-if skip: Manual deployments, no rollback, environment drift.

How

/kubernetes:deploy              # ArgoCD application sync (agents prepare)
# HITL runs: terraform apply # Human executes after reviewing plan

Output

  • Zero-downtime deployment with automated rollback
  • Health checks passing across all services
  • Deployment evidence in tmp/

Quality Gate: HITL reviews terraform plan. Health checks green.


Stage 5: Support & Scale (ongoing)

Who: sre-engineer runs READONLY operations (monitor sub-step [A-readonly]). finops-engineer analyzes costs. HITL triages findings and approves remediations ([HITL-decide]).

What: Inventory cross-validation, health event triage, certificate monitoring, decommission scoring, DORA metrics, cost optimization.

Why: Continuous validation prevents drift. Evidence-based decommission eliminates zombie resources. DORA tracks infrastructure velocity.

What-if skip: Invisible drift, expired certificates, zombie resources, unchecked cloud spend.

Operate sub-step (monitor [A-readonly])

/inventory:lz-cross-validate     # 4-way cross-validation pipeline
/cloudops:weekly-cert-report # Certificate expiry monitoring

Output: Cross-validated asset register with accuracy deltas. Certificate expiry dashboard (30/60/90 day triage).

Quality Gate: Cross-validation accuracy >=99.5%. Zero expired-in-use certs.

Optimize sub-step (remediate [HITL-decide])

/finops:decommission-inventory   # Scream-test scored decommission candidates
/metrics:update-dora # Infrastructure DORA metrics

Output: Decommission candidates with S1-S5 scream-test scores. DORA metrics updated. Cost savings identified.

Quality Gate: Score >=70 flagged for scream test. DORA visible. HITL approves decommission actions.


LEAN/5S Applied to CloudOps

PrincipleApplicationEvidence
SortConfig Aggregator replaces per-account search/inventory:discover under 10s
Set in Order3-tier IaC testing (functional/integration/E2E)/terraform:test pipeline
ShineDocker-first enforcement — no bare-metal toolsenforce-container-first.sh
Standardizennthanh101/* registry only — supply chain verifiedenforce-docker-registry.sh
SustainDORA metrics track infrastructure velocity/metrics:update-dora

By Persona

Solo CloudOps Engineer

Path: /inventory:discover/security:cert-inventory/finops:decommission-inventory

Time to Value: Org-wide inventory in under 10 seconds.

Infrastructure Team Lead

Path: /terraform:test/devcontainer:validate-registry/metrics:update-dora

Time to Value: Governed IaC pipeline in 1 day.

Enterprise Cloud Architect

Path: /cdk:synth/terraform:cost/inventory:lz-cross-validate

Time to Value: Architecture validation with evidence in 1 hour.


Common Mistakes (Anti-Patterns)

MistakeWhy It FailsFix
NARROW_SEARCH_SCOPEPer-account search misses resourcesConfig Aggregator org-wide first
SINGLE_ACCOUNT_ASSUMPTIONTrusting task-provided account IDPhase 0 org-wide discovery
BARE_METAL_TOOLSRunning terraform on hostenforce-container-first.sh
REBOOT_FIRST_DECOMMISSION_SECONDTreating symptom not diseaseScream-test score before maintenance
EXISTENCE_WITHOUT_ACTIVITYFinding resources without checking activityValidate last-modified/flow-logs
PUSH_WITHOUT_LOCAL_CIPushing workflows without local validationtask ci:lint-workflows first

Quick Reference: Command Cheat Sheet

# Discover (org-wide)
/inventory:discover

# Design + Test IaC
/terraform:test
/cdk:synth

# Validate cost + registry
/terraform:cost
/devcontainer:validate-registry

# Operate
/inventory:lz-cross-validate
/cloudops:weekly-cert-report

# Optimize
/finops:decommission-inventory
/metrics:update-dora

Agent Team

AgentRole in This PathStageTalent Bench
infrastructure-engineerCDK/Terraform code generation + IaC executionBuild/DeployProfile
kubernetes-engineerK3s/K8s cluster lifecycle + container orchestrationDeploy/Support & ScaleProfile
devops-security-engineerIaC security scanning (checkov, tfsec) + hardeningBuildProfile
cloud-architectMulti-cloud architecture design + cost progression validationDesign/Support & ScaleProfile
qa-automation-engineer3-tier testing (snapshot→LocalStack→AWS) + regression validationBuild/DeployProfile

7 Skills Coverage

SkillCoverage in This PathImplementation
S1 System DesignCDK→Terraform→K8s progression from design to productionIaC architecture, module patterns, cloud-agnostic abstraction
S2 Tool DesignTerraform schemas + CDK construct validation + kubectl manifest schemasTool integration, input validation, error messages
S3 RetrievalAWS/Azure provider docs, CloudFormation resource docs via context7Documentation lookups, API contract discovery, schema reference
S4 Reliabilityterraform state locking + retry config on API calls + dry-run validationState consistency, mutation safety, preview before apply
S5 Securitycheckov/tfsec security scanning (SAST), docker-registry validation (supply chain), RBAC enforcementShift-left security, registry allowlist, access control
S6 Evaluation3-tier testing: snapshot tests (fast), LocalStack (integration), real AWS (E2E)Progressive validation, early failure detection, production confidence
S7 Product ThinkingCost progression visualization: LOCAL ($0) → DEV ($20) → STAGING ($80) → PROD ($180), team self-service infrastructure as productsCost awareness, team autonomy, business impact (TTM reduction)

Last Updated: March 2026 | Status: Active | Maintenance: infrastructure-engineer