Skip to main content

CloudOps & Infrastructure Lifecycle

The objective for technology is to make it easy for your pods to constantly develop and release digital and AI innovations.

AI agents build governed. Humans ship trusted. 80% autonomy, 100% accountability.

Golden Path: From Discovery to Optimized Infrastructure


Phase 1: Discover (10 seconds)

Who: infrastructure-engineer queries org-wide. HITL reviews inventory.

What: Org-wide resource discovery via Config Aggregator — all accounts, one query.

Why: Org-wide before per-account prevents NARROW_SEARCH_SCOPE anti-pattern. Config Aggregator is P1 path.

What-if skip: SINGLE_ACCOUNT_ASSUMPTION — missed resources in backup accounts.

How

/inventory:discover

Output

  • Asset register across 67+ accounts
  • Resource counts by type and account
  • Discovery completed in under 10 seconds

Quality Gate: All accounts visible. Config Aggregator responsive.


Phase 2: Design (1-2 hours)

Who: cloud-architect designs. infrastructure-engineer implements IaC modules.

What: IaC architecture with CDK or Terraform. Docker-first supply chain enforcement.

Why: IaC makes infrastructure reproducible. Docker-first (nnthanh101/* only) ensures supply chain integrity.

What-if skip: Snowflake servers, BARE_METAL_TOOLS violations.

How

/terraform:test    # 3-tier testing: functional, integration, E2E
/cdk:synth # CDK synthesis with cdk-nag security checks

Output

  • Validated IaC modules (Terraform HCL or CDK TypeScript)
  • 3-tier test results (functional/integration/E2E)
  • Architecture decision records (ADRs)

Quality Gate: All tests pass. Registry compliance verified.


Phase 3: Validate (30 min)

Who: qa-engineer validates. infrastructure-engineer fixes findings.

What: Pre-deploy cost estimation + security scanning + registry compliance.

Why: Testing catches security issues before terraform plan. Cost estimation prevents surprise bills.

What-if skip: Unchecked deployments, surprise bills, compliance failures.

How

/terraform:cost                  # Infracost pre-deploy estimation
/devcontainer:validate-registry # Docker registry compliance

Output

  • Cost estimate within budget constraints
  • Security scan clean (checkov + tfsec + trivy)
  • Registry compliance score (nnthanh101/* only)

Quality Gate: Cost within budget. Zero CRITICAL/HIGH. Registry 100% compliant.


Phase 4: Deploy (HITL gate)

Who: infrastructure-engineer + kubernetes-engineer prepare. HITL approves and commits.

What: GitOps deployment via ArgoCD or ECS with health checks.

Why: Agents prepare, humans decide, humans commit. Principle I: no agent executes terraform apply.

What-if skip: Manual deployments, no rollback, environment drift.

How

/kubernetes:deploy              # ArgoCD application sync (agents prepare)
# HITL runs: terraform apply # Human executes after reviewing plan

Output

  • Zero-downtime deployment with automated rollback
  • Health checks passing across all services
  • Deployment evidence in tmp/

Quality Gate: HITL reviews terraform plan. Health checks green.


Phase 5: Operate (ongoing)

Who: sre-automation-specialist runs READONLY operations. HITL triages findings.

What: Inventory cross-validation, health event triage, certificate monitoring.

Why: Continuous validation prevents drift. Health events triaged proactively.

What-if skip: Invisible drift, expired certificates, unmanaged resources.

How

/inventory:lz-cross-validate     # 4-way cross-validation pipeline
/cloudops:weekly-cert-report # Certificate expiry monitoring

Output

  • Cross-validated asset register with accuracy deltas
  • Certificate expiry dashboard (30/60/90 day triage)
  • Health event investigation evidence

Quality Gate: Cross-validation accuracy >=99.5%. Zero expired-in-use certs.


Phase 6: Optimize (per sprint)

Who: gitops-cost-optimizer analyzes. HITL decides decommission actions.

What: Decommission unused resources, rightsize, track infrastructure DORA metrics.

Why: Evidence-based decommission. REBOOT_FIRST_DECOMMISSION_SECOND anti-pattern eliminated.

What-if skip: Zombie resources persist, cloud spend grows unchecked.

How

/finops:decommission-inventory   # Scream-test scored decommission candidates
/metrics:update-dora # Infrastructure DORA metrics

Output

  • Decommission candidates with S1-S5 scream-test scores
  • DORA metrics updated (deploy frequency, lead time, CFR, MTTR)
  • Cost savings identified and attributed

Quality Gate: Score >=70 flagged for scream test. DORA visible.


LEAN/5S Applied to CloudOps

PrincipleApplicationEvidence
SortConfig Aggregator replaces per-account search/inventory:discover under 10s
Set in Order3-tier IaC testing (functional/integration/E2E)/terraform:test pipeline
ShineDocker-first enforcement — no bare-metal toolsenforce-container-first.sh
Standardizennthanh101/* registry only — supply chain verifiedenforce-docker-registry.sh
SustainDORA metrics track infrastructure velocity/metrics:update-dora

By Persona

Solo CloudOps Engineer

Path: /inventory:discover/security:cert-inventory/finops:decommission-inventory

Time to Value: Org-wide inventory in under 10 seconds.

Infrastructure Team Lead

Path: /terraform:test/devcontainer:validate-registry/metrics:update-dora

Time to Value: Governed IaC pipeline in 1 day.

Enterprise Cloud Architect

Path: /cdk:synth/terraform:cost/inventory:lz-cross-validate

Time to Value: Architecture validation with evidence in 1 hour.


Common Mistakes (Anti-Patterns)

MistakeWhy It FailsFix
NARROW_SEARCH_SCOPEPer-account search misses resourcesConfig Aggregator org-wide first
SINGLE_ACCOUNT_ASSUMPTIONTrusting task-provided account IDPhase 0 org-wide discovery
BARE_METAL_TOOLSRunning terraform on hostenforce-container-first.sh
REBOOT_FIRST_DECOMMISSION_SECONDTreating symptom not diseaseScream-test score before maintenance
EXISTENCE_WITHOUT_ACTIVITYFinding resources without checking activityValidate last-modified/flow-logs
PUSH_WITHOUT_LOCAL_CIPushing workflows without local validationtask ci:lint-workflows first

Quick Reference: Command Cheat Sheet

# Discover (org-wide)
/inventory:discover

# Design + Test IaC
/terraform:test
/cdk:synth

# Validate cost + registry
/terraform:cost
/devcontainer:validate-registry

# Operate
/inventory:lz-cross-validate
/cloudops:weekly-cert-report

# Optimize
/finops:decommission-inventory
/metrics:update-dora

Last Updated: March 2026 | Status: Active | Maintenance: infrastructure-engineer