Enterprise-Grade Testing & Quality Roadmap

Test early, test always, test automatically — 10x cheaper at test time than in production.

This page covers the testing philosophy, toolchain, and maturity roadmap that underpins every ADLC project. It answers two questions for enterprise buyers: what standard are we targeting and why does it matter commercially.

Enterprise Standards Checklist (2026-2030)

Domain	Standard	Current State	Target State	When
Testing	Coverage ≥80%, 3-tier progressive	59% (baseline)	≥80%	2026
Testing	TDD discipline — red/green/refactor	Introduced S1	Sprint default	2026
Testing	BDD Given/When/Then docstrings	Introduced S1	All public APIs	2026
Security	SAST — `bandit` static analysis	CI gate live	0 HIGH/CRITICAL	2026
Security	SCA — `pip-audit` dependency scan	CI gate live	Weekly scheduled	2026
Security	SBOM generation (`cyclonedx`)	Planned	Every release	2026-Q3
Security	Secret scanning (`truffleHog`)	Planned	Pre-commit + CI	2026-Q3
CI/CD	Docker-first `act` local validation	Adopted	All workflows	2026
CI/CD	GitOps — branch protection + signed tags	Partial	Main branch gated	2026
CI/CD	Pinned SHA digests for all actions	Planned	All workflows	2026-Q3
Observability	MELT telemetry (Metrics/Events/Logs/Traces)	Planned	Per SLO	2026-Q4
Observability	SLO definitions with error budgets	Planned	Per service	2027
Observability	DORA metrics — all 4 tracked	DORA captured	Automated	2026
Compliance	APRA CPS 234 (FSI)	Documented	Auditable	2027
Compliance	SOC 2 Type II controls	Documented	Certified	2027-Q3
Supply Chain	Pinned dependencies with hashes	Partial (`uv.lock`)	All projects	2026
Supply Chain	Wolfi/distroless base containers	Adopted for E2E	All containers	2026-Q3
Supply Chain	SLSA Level 2+ (provenance attestation)	Planned	Release pipeline	2027
Supply Chain	Reproducible builds	Planned	SLSA L3	2028

Current Baseline

pytest: 33/33 PASS, 59.27% coverage (as of 2026-03-17). CI gates: 16 active quality gates. Hook tests: 190+ cases, 9 suites PASS.

TDD Business Value

Test-Driven Development is not a testing technique — it is a design technique that produces testable software as a by-product.

TDD Value Table

Dimension	Value	Evidence
Defect Prevention	Catch bugs before release	5 source-code bugs found during TDD red phase (S1)
Regression Safety	Tests as a safety net for refactoring	653 tests → 5,900+ tests this sprint
Refactoring Confidence	BDD tests as living specification	Given/When/Then docstrings verified against behaviour
Release Velocity	CI gates block broken code from merging	16 quality gates — zero broken merges to main
Cost Reduction	10x cheaper to fix at test time	vs production incidents (IBM Systems Sciences Institute)
Design Clarity	Forces modular, injectable interfaces	Click CLI commands testable in isolation

TDD Cycle Applied to ADLC

RED   — Write a failing test that specifies the behaviour
GREEN — Write the minimum code to make the test pass
REFACTOR — Clean up while keeping tests green
REPEAT

# Example: TDD for a CloudOps CLI command

# Step 1: RED — test first
def test_list_ec2_instances_returns_table(runner, mock_boto3):
    """Given valid AWS credentials, when listing EC2, then render a Rich table."""
    result = runner.invoke(cli, ["ec2", "list"])
    assert result.exit_code == 0
    assert "Instance ID" in result.output

# Step 2: GREEN — minimal implementation to pass
# Step 3: REFACTOR — extract table rendering to rich_utils.py

BDD Business Value

Behaviour-Driven Development elevates tests from implementation checks to business-readable contracts.

BDD Value Table

Dimension	Value	Application
Living Documentation	Tests ARE the API specification	Test function names and docstrings are the spec
Stakeholder Alignment	Business users can read and verify behaviour	Given/When/Then readable without Python knowledge
API Contract Testing	Locks public API shape against regression	Click CLI `--help` output and parameter names are contract
Rich CLI Docs	`pdoc` auto-generates HTML from docstrings	`task docs:api` → browseable API reference
Acceptance Criteria Traceability	User story ACs map to test functions	Each AC reference tagged in test docstring

BDD Pattern in ADLC Projects

def test_cost_explorer_returns_monthly_summary(runner, mock_mcp):
    """
    Given: A valid AWS profile with Cost Explorer access
    When:  The user runs `runbooks cost monthly --profile dev`
    Then:  A table showing service-level costs is rendered
    And:   Total cost is displayed in the summary row
    And:   Exit code is 0
    """
    result = runner.invoke(cli, ["cost", "monthly", "--profile", "dev"])
    assert result.exit_code == 0
    assert "Amazon EC2" in result.output
    assert "Total" in result.output

Living Documentation

When a BDD test fails, the failure message names the broken behaviour in business terms — not just a line number. This makes triage faster for both engineers and HITL managers.

ADLC Component Coverage Map

Every framework component is validated, not just source code. The coverage map tracks what has tests and what does not.

Component	Count	Test Type	Status
Core Agents	10	Frontmatter schema + behavioural spec	Validated
Commands	74	Frontmatter schema validation	Validated
Hooks	22	Functional tests — 22/22 tested (300+ cases)	PASS
Skills	20	Referenced file existence validated	Validated
Settings / MANIFEST	2	Cross-validated (settings ↔ hooks ↔ agents)	PASS
MCPs	58	JSON schema + connectivity shape	Valid JSON
CLI source code	22 modules	pytest unit + integration	59% coverage (baseline)
E2E / Playwright	22 tests	Browser automation against Docusaurus	PASS

Coverage Gap

59% source coverage is below the 80% enterprise target. The active remediation track is CloudOps-Runbooks S5: moto test unskipping + quality gate fixes. Target: ≥80% by end of S5.

4-Level Enterprise Standards Pyramid

Quality matures in layers. Each level depends on the one below it being stable.

         ┌─────────────────────────────────┐
         │  Level 4: AI Governance         │  2027-2030
         │  Agent audit trails             │
         │  LLM-as-Judge evaluation        │
         │  Constitutional AI enforcement  │
         └─────────────────────────────────┘
        ┌───────────────────────────────────┐
        │  Level 3: Supply Chain Security   │  2026-2027
        │  SLSA Level 3                     │
        │  Signed releases (Sigstore)       │
        │  Reproducible builds              │
        │  Wolfi base images                │
        └───────────────────────────────────┘
       ┌─────────────────────────────────────┐
       │  Level 2: Security Hardening        │  2025-2026
       │  SAST (bandit)                      │
       │  SCA (pip-audit)                    │
       │  SBOM (cyclonedx)                   │
       │  Secret scanning (truffleHog)       │
       │  Pinned action SHA digests          │
       └─────────────────────────────────────┘
      ┌───────────────────────────────────────┐
      │  Level 1: Code Quality Foundation     │  2024-2026
      │  Linting — ruff (0 errors)            │
      │  Type checking — pyright/mypy         │
      │  Coverage ≥80% (pytest --cov)         │
      │  TDD red/green/refactor discipline    │
      │  BDD Given/When/Then docstrings       │
      └───────────────────────────────────────┘

Level Detail

Level 1 — Code Quality (Foundation, 2024-2026)

The baseline that enables everything else. Without passing linting and working tests, security scans produce false positives and supply chain tooling cannot attest builds.

Tool	Rule	Enforcement
`ruff check`	0 errors (warnings allowed)	CI gate — blocks merge
`pyright` / `mypy`	No untyped public functions	CI advisory (becoming gate)
`pytest --cov`	≥80% line coverage	CI gate (currently advisory at 59%)
`ruff format`	Consistent formatting	Pre-commit hook

Level 2 — Security Hardening (2025-2026)

Shift security left: every commit is scanned, not just releases.

Tool	Purpose	Scope
`bandit`	SAST — Python code patterns	0 HIGH/CRITICAL to merge
`pip-audit`	SCA — known CVEs in deps	Weekly scheduled + PR trigger
`cyclonedx-python`	SBOM generation	Every tagged release
`truffleHog`	Secret scanning	Pre-commit + CI on push
Pinned SHA digests	Actions supply chain	All `.github/workflows/`

Level 3 — Supply Chain Security (2026-2027)

Prove the build artifact matches the source. Required for APRA CPS 234 attestation.

Standard	Mechanism	Target
SLSA Level 2	GitHub Actions provenance	2026-Q4
SLSA Level 3	Isolated build environment	2027
Sigstore / cosign	Container + release signing	2027
Wolfi base images	CVE-minimal containers	2026-Q3
Reproducible builds	Deterministic wheel builds	2028

Level 4 — AI Governance (2027-2030)

ADLC-specific: governing AI agents is a distinct discipline from governing software.

Capability	Mechanism	Why It Matters
Agent audit trails	JSON evidence in `tmp/` → promoted to git	Regulators require agent decision traceability
LLM-as-Judge evaluation	Automated scoring of agent output quality	Replaces manual HITL review for routine quality checks
Constitutional AI enforcement	58 checkpoints, 33 anti-patterns, hook guards	Governance embedded in process, not policed after the fact
Drift detection	Scheduled re-evaluation of deployed agents	Agents degrade without retraining signals

Progressive Testing Pipeline

Three tiers ensure fast feedback during development and deep validation before release. The pipeline gates are additive: Tier 2 only runs if Tier 1 passes.

Tier 1: Unit Tests          Tier 2: Integration         Tier 3: E2E
─────────────────────       ─────────────────────       ─────────────────────
Duration:  ~2 seconds       Duration:  ~30 seconds      Duration:  ~15 minutes
Cost:      $0               Cost:      $0               Cost:      ~$5 (CI mins)
Scope:     Functions/        Scope:     CLI commands     Scope:     Browser +
           modules           + mock AWS (moto)           AWS sandbox
Tools:     pytest            pytest + moto +             Playwright +
           + fixtures         LocalStack                  LocalStack/AWS
Gate:      All tests PASS    All tests PASS +            All E2E PASS +
                             ≥80% coverage               Lighthouse ≥90

Running the Pipeline Locally

# Tier 1 — fast unit feedback (run on every save)
task test:unit

# Tier 2 — integration with mocked AWS
task test:integration

# Tier 3 — full E2E (run before PR)
task test:e2e

# Full progressive pipeline (Tier1 → Tier2 → Tier3, gate-chained)
task test:progressive

Local = CI

The same task commands run in CI. No "it works on my machine" discrepancies. act validates workflow YAML locally before push.

Quality Gate Reference

All 16 gates run on every pull request targeting main. Gates marked BLOCK fail the PR; gates marked ADVISORY post a warning comment.

Gate	Tool	Threshold	Enforcement
Lint	`ruff check`	0 errors	BLOCK
Format	`ruff format --check`	0 diff	BLOCK
Unit tests	`pytest tests/unit`	100% PASS	BLOCK
Integration tests	`pytest tests/integration`	100% PASS	BLOCK
Coverage	`pytest --cov`	≥80% (advisory until S5)	ADVISORY
Type checking	`pyright`	0 errors on public API	ADVISORY
SAST	`bandit -r src`	0 HIGH/CRITICAL	BLOCK
SCA	`pip-audit`	0 CRITICAL CVEs	BLOCK
IaC security	`checkov`	0 FAILED policies	BLOCK
Container scan	`trivy`	0 HIGH/CRITICAL	BLOCK
Secrets	`truffleHog`	0 detected	BLOCK
Infracost	`infracost diff`	≤+5% cost delta	ADVISORY
Hook tests	`bash tests/hooks/run-all-tests.sh`	190+ cases PASS	BLOCK
E2E smoke	Playwright `@smoke`	All PASS	BLOCK
Lighthouse	`lhci autorun`	≥90 performance	ADVISORY
SBOM	`cyclonedx-python`	Generated	ADVISORY

Compliance Mapping

Enterprise FSI buyers require demonstrable mapping from testing practice to regulatory obligation.

Regulation	Requirement	ADLC Control	Evidence Artifact
APRA CPS 234	Information security capability	SAST + SCA + secret scanning gates	CI scan reports
APRA CPS 234	Incident response testing	Runbook tests + chaos scenarios	Test suite results
SOC 2 Type II	Change management	PR gates + HITL approval	Git history + PR log
SOC 2 Type II	Availability monitoring	MELT telemetry + SLOs	Dashboard exports
ISO 27001	Vulnerability management	`pip-audit` weekly + SCA	Scan reports in CI
GDPR / Privacy Act	Data handling controls	`bandit` SQL injection rules	SAST report

Command Center Page Reference

The Command Center voice-enabled RAG chatbot has its own landing page at /command-center and a PR/FAQ at /docs/business-cases/command-center-prfaq. Testing methodology for Command Center specifically (including the 8/10 ops question accuracy target at ≤$180/month) follows the progressive pipeline above, with the addition of:

RAG accuracy testing: LLM-as-Judge scoring against a golden question set
Voice round-trip testing: speech → ASR → RAG → TTS → latency ≤3s P95
Multi-account AWS fixture isolation via moto account-level patching

Source files: docs/src/pages/command-center.jsx (landing page), docs/docs/business-cases/command-center-prfaq.mdx (PR/FAQ).

Evidence

Coverage baseline: pytest --cov output 2026-03-17 — 33/33 PASS, 59.27%
Hook test suite: tests/hooks/run-all-tests.sh — 9 suites, 190+ cases PASS
CLI tests: framework/cli/ — 337 tests, 12 suites PASS
Quality gates config: .github/workflows/ — 16 active gates
Anti-patterns list: .claude/rules/adlc-governance.md — 33 patterns tracked
Constitution checkpoints: .claude/memory/constitution.md — 58 checkpoints

Enterprise Standards Checklist (2026-2030)​

TDD Business Value​

TDD Value Table​

TDD Cycle Applied to ADLC​

BDD Business Value​

BDD Value Table​

BDD Pattern in ADLC Projects​

ADLC Component Coverage Map​

4-Level Enterprise Standards Pyramid​

Level Detail​

Level 1 — Code Quality (Foundation, 2024-2026)​

Level 2 — Security Hardening (2025-2026)​

Level 3 — Supply Chain Security (2026-2027)​

Level 4 — AI Governance (2027-2030)​

Progressive Testing Pipeline​

Running the Pipeline Locally​

Quality Gate Reference​

Compliance Mapping​

Command Center Page Reference​

Evidence​