Skip to main content
1. Product2. Agents3. Governance4. CloudOps5. FinOps6. Security
AI + Data + Cloud · Pillar 4
☁️

CloudOps & Infrastructure

Technology

127CLI Commands (v1.3.17)

67 AWS accounts discovered in 2.67 seconds. 127 CLI commands via pip install runbooks==1.3.17. Terraform + CDK with 3-tier testing. Docker-first supply chain.

pip install runbooks==1.3.17 — org-wide inventory in <10 seconds
AI agents build governed & Humans ship trusted. 80% autonomy & 100% accountability.
Section Four (Ch.17-23)

Technology for Speed and Distributed Innovation

The objective for technology is to make it easy for your pods to constantly develop and release digital and AI innovations to customers and users. Seven broad capabilities are needed to build a technology environment that can support a digital transformation.

A more surgical and value-backed approach to cloud. The automation of software development and deployment is fundamental to building and releasing high-quality software. ADLC delivers this through Docker-first enforcement (nnthanh101/* only), Local-First Hybrid-Cloud (Docker/K3D -> AWS), and multi-account landing zones with READONLY-safe automation. Install: uv add runbooks or pip install runbooks==1.3.17 — then run any of 127 CLI commands against READONLY profiles.

Platform Evolution

IaC generation improves with each Claude release — more accurate Terraform modules, better CDK constructs. NemoClaw adds kernel-level security validation for agent-generated infrastructure.

CloudOps & Infrastructure Golden Path

Each phase answers: Who does it, Why it matters, What if you skip it

1Discover

Org-wide resource discovery via Config Aggregator — all accounts, one query

/inventory:discover
Who: infrastructure-engineer queries org-wide, HITL reviews inventory
Why: Org-wide before per-account prevents NARROW_SEARCH_SCOPE anti-pattern. Config Aggregator is P1; per-account search is P2 fallback only.
Skip? SINGLE_ACCOUNT_ASSUMPTION — missed resources in backup accounts; incomplete inventory for compliance audits
Asset register across 67+ accounts in <10 seconds
2Design

IaC architecture with CDK or Terraform — Docker-first supply chain

/terraform:test + /cdk:synth
Who: cloud-architect designs, infrastructure-engineer implements IaC modules
Why: IaC makes infrastructure reproducible. Docker-first enforcement (nnthanh101/* only) ensures supply chain integrity and air-gap readiness.
Skip? Snowflake servers, non-reproducible builds, supply chain vulnerabilities, BARE_METAL_TOOLS violations
Validated IaC modules with 3-tier testing (functional/integration/E2E)
3Validate

3-tier testing with pre-deploy cost estimation

/terraform:cost + /devcontainer:validate-registry
Who: qa-engineer validates, infrastructure-engineer fixes findings
Why: Testing catches security issues before terraform plan. Cost estimation prevents surprise bills. Registry validation ensures supply chain compliance.
Skip? Unchecked deployments, surprise bills, compliance failures at audit time
Security scan clean + cost estimate within budget + registry compliant
4Deploy

GitOps deployment via ArgoCD or ECS with health checks

/kubernetes:deploy (HITL approves terraform apply)
Who: infrastructure-engineer + kubernetes-engineer prepare, HITL approves and commits
Why: Agents prepare, humans decide, humans commit. Principle I: no agent executes terraform apply or git push.
Skip? Manual deployments, no rollback plan, environment drift between dev/staging/prod
Zero-downtime deployment with automated rollback capability
5Operate

Inventory cross-validation, health event triage, certificate monitoring

/inventory:lz-cross-validate + /cloudops:weekly-cert-report
Who: sre-engineer runs READONLY operations, HITL triages findings
Why: Continuous validation prevents drift. Health events triaged proactively. 176 ACM certs across 31 accounts monitored.
Skip? Invisible configuration drift, expired certificates causing outages, unmanaged resources
Validated asset register + cert expiry dashboard + health event triage
6Optimize

Decommission unused resources, rightsize, track infrastructure DORA

/finops:decommission-inventory + /metrics:update-dora
Who: finops-engineer analyzes with READONLY profiles, HITL decides decommission actions
Why: Evidence-based decommission with scream-test scoring. REBOOT_FIRST_DECOMMISSION_SECOND anti-pattern eliminated.
Skip? Zombie resources persist, cloud spend grows unchecked, no infrastructure hygiene
Decommission candidates scored + DORA metrics updated + cost savings identified

Start Here

Spec-Driven workflow and product skills — copy/paste to start

Solo CloudOps Engineer
You manage a landing zone alone. Org-wide visibility in seconds, not hours.
1./inventory:discover
2./security:cert-inventory
3./finops:decommission-inventory
Org-wide inventory in <10 seconds
Infrastructure Team Lead
Your team builds IaC for multi-account AWS. Need governed pipelines.
1./terraform:test
2./devcontainer:validate-registry
3./metrics:update-dora
Governed IaC pipeline in 1 day
Enterprise Cloud Architect
You design landing zones for regulated industries. Need evidence for auditors.
1./cdk:synth
2./terraform:cost
3./inventory:lz-cross-validate
Architecture validation with evidence in 1 hour

Component Map

12 components implementing this pillar

TypeNameWhyBusiness Value
Agentinfrastructure-engineer (sonnet)CDK + Terraform IaC for multi-account AWS landing zonesGitOps reproducibility — every resource declaratively defined
Agentkubernetes-engineer (sonnet)K3s + ArgoCD + Helm for containerised workloadsPlatform engineering that scales from laptop to prod
Command/terraform:test3-tier testing: functional, integration, E2ESecurity issues caught before terraform plan
Command/terraform:costInfracost pre-deploy cost estimation with FOCUS complianceFinOps integrated into IaC review — no surprise bills
Command/cdk:synthCDK synthesis with cdk-nag security checksAPRA CPS 234 alignment verified at synth time
Command/devcontainer:validate-registryScan all FROM and image: references for registry complianceSupply chain integrity — blocked registries caught in PR
Command/kubernetes:deployArgoCD application sync with health checksZero-downtime deployments with automated rollback
Command/inventory:lz-cross-validateREADONLY multi-account inventory cross-validationConfig Aggregator org-wide — 67 accounts in <10 seconds
Skillaws-health-event-triageEC2 health event investigation workflowREBOOT_FIRST_DECOMMISSION_SECOND anti-pattern eliminated
Skillterraform/deploy-lifecycleTerraform module publish and deploy lifecycle patternsRegistry-to-production pipeline with 3-tier testing
Hookenforce-container-first.shBlock bare-metal tflint/checkov/terraform on hostReproducible validation — same result on every machine
Hookenforce-docker-registry.shBlock non-compliant container registry referencesSLSA Level 2+ provenance — only signed enterprise images

Risk & Scalability

What happens without this pillar, and why ADLC scales from 1 person to enterprise

What if you skip?

Industry research identifies seven capabilities needed for technology environments: decoupled architecture, surgical cloud, engineering practices, developer productivity, production-grade solutions, security from the start, and MLOps. Without this pillar, infrastructure becomes the bottleneck that prevents all other pods from innovating.

Scalability

Docker-first enforcement and 3-tier IaC testing work identically on a laptop and in CI/CD. Config Aggregator discovers resources org-wide regardless of account count. The infrastructure tooling scales with the cloud footprint. CloudOps Vizro Dashboard (available at /component-usage) surfaces three operational KPIs: tickets resolved per quarter (248, TBD pending audit), MTTR with agent assistance (7 seconds, source: runbooks v1.3.17 smoke tests), and manual toil reduction (~60%, TBD pending audit). Five Jupyter notebooks cover the most common CloudOps workflows — install runbooks then open the relevant notebook.

Industry Relevance

ANZ enterprise verticals where this pillar is most critical

FSI
Multi-account landing zones with SCPs for regulatory data residency
Energy
SCADA/OT network isolation via Terraform VPC modules
Telecom
Edge computing K3s clusters for 5G MEC workloads
Aviation
Air-gapped environments with supply-chain-verified container images

Continuous Improvement Flywheel

Each pillar feeds the next — creating a self-reinforcing cycle of capability building

Pillar 4 feeds Pillar 5
CloudOps & InfrastructureFinOps & Analytics

Infrastructure generates cost and usage data. FinOps transforms raw cloud spend into business intelligence.

Digital Products

Real products built and governed by this pillar

Explore Pillar 4 Components

Browse the full component catalog or read the documentation

AI agents build governed & Humans ship trusted.

CloudOps Notebooks

Five Jupyter notebooks covering the most common CloudOps workflows. Install pip install runbooks==1.3.17 then open the relevant notebook.

💰AWS Cost Anomaly Investigation
Correlate Cost Explorer alerts with service-level spend across all accounts
Open in Jupyter ↗
🗂️Cross-account Resource Inventory
Config Aggregator org-wide query — 67 accounts, <10 seconds, CSV export
Open in Jupyter ↗
🔧ITSM Incident Triage (8-step)
EC2 health event → CloudTrail → scream-test → decommission decision tree
Open in Jupyter ↗
🗑️Decommissioning Workflow
Scream-test + CloudWatch + SSM + EBS signals scored before any removal action
Open in Jupyter ↗
🔍VPC Flow Log Analysis (SOC2)
Inbound/outbound traffic baseline, anomaly export, auditor-ready CSV
Open in Jupyter ↗

CloudOps Vizro Dashboard

Python low-code operational dashboard built with Vizro (McKinsey open-source). Surfaces runbooks data as live KPIs — tickets resolved, MTTR, and toil reduction — without custom D3 or BI tooling.

248
Tickets Resolved / Quarter
TBD pending audit
7s
MTTR with Agent Assistance
source: runbooks v1.3.17 smoke tests
~60%
Manual Toil Reduction
TBD pending audit
Open in Vizro