Enterprise-Grade Testing & Quality Roadmap
Test early, test always, test automatically — 10x cheaper at test time than in production.
This page covers the testing philosophy, toolchain, and maturity roadmap that underpins every ADLC project. It answers two questions for enterprise buyers: what standard are we targeting and why does it matter commercially.
Enterprise Standards Checklist (2026-2030)
| Domain | Standard | Current State | Target State | When |
|---|---|---|---|---|
| Testing | Coverage ≥80%, 3-tier progressive | 59% (baseline) | ≥80% | 2026 |
| Testing | TDD discipline — red/green/refactor | Introduced S1 | Sprint default | 2026 |
| Testing | BDD Given/When/Then docstrings | Introduced S1 | All public APIs | 2026 |
| Security | SAST — bandit static analysis | CI gate live | 0 HIGH/CRITICAL | 2026 |
| Security | SCA — pip-audit dependency scan | CI gate live | Weekly scheduled | 2026 |
| Security | SBOM generation (cyclonedx) | Planned | Every release | 2026-Q3 |
| Security | Secret scanning (truffleHog) | Planned | Pre-commit + CI | 2026-Q3 |
| CI/CD | Docker-first act local validation | Adopted | All workflows | 2026 |
| CI/CD | GitOps — branch protection + signed tags | Partial | Main branch gated | 2026 |
| CI/CD | Pinned SHA digests for all actions | Planned | All workflows | 2026-Q3 |
| Observability | MELT telemetry (Metrics/Events/Logs/Traces) | Planned | Per SLO | 2026-Q4 |
| Observability | SLO definitions with error budgets | Planned | Per service | 2027 |
| Observability | DORA metrics — all 4 tracked | DORA captured | Automated | 2026 |
| Compliance | APRA CPS 234 (FSI) | Documented | Auditable | 2027 |
| Compliance | SOC 2 Type II controls | Documented | Certified | 2027-Q3 |
| Supply Chain | Pinned dependencies with hashes | Partial (uv.lock) | All projects | 2026 |
| Supply Chain | Wolfi/distroless base containers | Adopted for E2E | All containers | 2026-Q3 |
| Supply Chain | SLSA Level 2+ (provenance attestation) | Planned | Release pipeline | 2027 |
| Supply Chain | Reproducible builds | Planned | SLSA L3 | 2028 |
pytest: 33/33 PASS, 59.27% coverage (as of 2026-03-17). CI gates: 16 active quality gates. Hook tests: 190+ cases, 9 suites PASS.
TDD Business Value
Test-Driven Development is not a testing technique — it is a design technique that produces testable software as a by-product.
TDD Value Table
| Dimension | Value | Evidence |
|---|---|---|
| Defect Prevention | Catch bugs before release | 5 source-code bugs found during TDD red phase (S1) |
| Regression Safety | Tests as a safety net for refactoring | 653 tests → 5,900+ tests this sprint |
| Refactoring Confidence | BDD tests as living specification | Given/When/Then docstrings verified against behaviour |
| Release Velocity | CI gates block broken code from merging | 16 quality gates — zero broken merges to main |
| Cost Reduction | 10x cheaper to fix at test time | vs production incidents (IBM Systems Sciences Institute) |
| Design Clarity | Forces modular, injectable interfaces | Click CLI commands testable in isolation |
TDD Cycle Applied to ADLC
1. RED — Write a failing test that specifies the behaviour
2. GREEN — Write the minimum code to make the test pass
3. REFACTOR — Clean up while keeping tests green
4. REPEAT
# Example: TDD for a CloudOps CLI command
# Step 1: RED — test first
def test_list_ec2_instances_returns_table(runner, mock_boto3):
"""Given valid AWS credentials, when listing EC2, then render a Rich table."""
result = runner.invoke(cli, ["ec2", "list"])
assert result.exit_code == 0
assert "Instance ID" in result.output
# Step 2: GREEN — minimal implementation to pass
# Step 3: REFACTOR — extract table rendering to rich_utils.py
BDD Business Value
Behaviour-Driven Development elevates tests from implementation checks to business-readable contracts.
BDD Value Table
| Dimension | Value | Application |
|---|---|---|
| Living Documentation | Tests ARE the API specification | Test function names and docstrings are the spec |
| Stakeholder Alignment | Business users can read and verify behaviour | Given/When/Then readable without Python knowledge |
| API Contract Testing | Locks public API shape against regression | Click CLI --help output and parameter names are contract |
| Rich CLI Docs | pdoc auto-generates HTML from docstrings | task docs:api → browseable API reference |
| Acceptance Criteria Traceability | User story ACs map to test functions | Each AC reference tagged in test docstring |
BDD Pattern in ADLC Projects
def test_cost_explorer_returns_monthly_summary(runner, mock_mcp):
"""
Given: A valid AWS profile with Cost Explorer access
When: The user runs `runbooks cost monthly --profile dev`
Then: A table showing service-level costs is rendered
And: Total cost is displayed in the summary row
And: Exit code is 0
"""
result = runner.invoke(cli, ["cost", "monthly", "--profile", "dev"])
assert result.exit_code == 0
assert "Amazon EC2" in result.output
assert "Total" in result.output
When a BDD test fails, the failure message names the broken behaviour in business terms — not just a line number. This makes triage faster for both engineers and HITL managers.
ADLC Component Coverage Map
Every framework component is validated, not just source code. The coverage map tracks what has tests and what does not.
| Component | Count | Test Type | Status |
|---|---|---|---|
| Core Agents | 10 | Frontmatter schema + behavioural spec | Validated |
| Commands | 74 | Frontmatter schema validation | Validated |
| Hooks | 22 | Functional tests — 22/22 tested (300+ cases) | PASS |
| Skills | 20 | Referenced file existence validated | Validated |
| Settings / MANIFEST | 2 | Cross-validated (settings ↔ hooks ↔ agents) | PASS |
| MCPs | 58 | JSON schema + connectivity shape | Valid JSON |
| CLI source code | 22 modules | pytest unit + integration | 59% coverage (baseline) |
| E2E / Playwright | 22 tests | Browser automation against Docusaurus | PASS |
59% source coverage is below the 80% enterprise target. The active remediation track is CloudOps-Runbooks S5: moto test unskipping + quality gate fixes. Target: ≥80% by end of S5.
4-Level Enterprise Standards Pyramid
Quality matures in layers. Each level depends on the one below it being stable.
┌─────────────────────────────────┐
│ Level 4: AI Governance │ 2027-2030
│ Agent audit trails │
│ LLM-as-Judge evaluation │
│ Constitutional AI enforcement │
└─────────────────────────────────┘
┌───────────────────────────────────┐
│ Level 3: Supply Chain Security │ 2026-2027
│ SLSA Level 3 │
│ Signed releases (Sigstore) │
│ Reproducible builds │
│ Wolfi base images │
└───────────────────────────────────┘
┌─────────────────────────────────────┐
│ Level 2: Security Hardening │ 2025-2026
│ SAST (bandit) │
│ SCA (pip-audit) │
│ SBOM (cyclonedx) │
│ Secret scanning (truffleHog) │
│ Pinned action SHA digests │
└─────────────────────────────────────┘
┌───────────────────────────────────────┐
│ Level 1: Code Quality Foundation │ 2024-2026
│ Linting — ruff (0 errors) │
│ Type checking — pyright/mypy │
│ Coverage ≥80% (pytest --cov) │
│ TDD red/green/refactor discipline │
│ BDD Given/When/Then docstrings │
└───────────────────────────────────────┘
Level Detail
Level 1 — Code Quality (Foundation, 2024-2026)
The baseline that enables everything else. Without passing linting and working tests, security scans produce false positives and supply chain tooling cannot attest builds.
| Tool | Rule | Enforcement |
|---|---|---|
ruff check | 0 errors (warnings allowed) | CI gate — blocks merge |
pyright / mypy | No untyped public functions | CI advisory (becoming gate) |
pytest --cov | ≥80% line coverage | CI gate (currently advisory at 59%) |
ruff format | Consistent formatting | Pre-commit hook |
Level 2 — Security Hardening (2025-2026)
Shift security left: every commit is scanned, not just releases.
| Tool | Purpose | Scope |
|---|---|---|
bandit | SAST — Python code patterns | 0 HIGH/CRITICAL to merge |
pip-audit | SCA — known CVEs in deps | Weekly scheduled + PR trigger |
cyclonedx-python | SBOM generation | Every tagged release |
truffleHog | Secret scanning | Pre-commit + CI on push |
| Pinned SHA digests | Actions supply chain | All .github/workflows/ |
Level 3 — Supply Chain Security (2026-2027)
Prove the build artifact matches the source. Required for APRA CPS 234 attestation.
| Standard | Mechanism | Target |
|---|---|---|
| SLSA Level 2 | GitHub Actions provenance | 2026-Q4 |
| SLSA Level 3 | Isolated build environment | 2027 |
| Sigstore / cosign | Container + release signing | 2027 |
| Wolfi base images | CVE-minimal containers | 2026-Q3 |
| Reproducible builds | Deterministic wheel builds | 2028 |
Level 4 — AI Governance (2027-2030)
ADLC-specific: governing AI agents is a distinct discipline from governing software.
| Capability | Mechanism | Why It Matters |
|---|---|---|
| Agent audit trails | JSON evidence in tmp/ → promoted to git | Regulators require agent decision traceability |
| LLM-as-Judge evaluation | Automated scoring of agent output quality | Replaces manual HITL review for routine quality checks |
| Constitutional AI enforcement | 58 checkpoints, 33 anti-patterns, hook guards | Governance embedded in process, not policed after the fact |
| Drift detection | Scheduled re-evaluation of deployed agents | Agents degrade without retraining signals |
Progressive Testing Pipeline
Three tiers ensure fast feedback during development and deep validation before release. The pipeline gates are additive: Tier 2 only runs if Tier 1 passes.
Tier 1: Unit Tests Tier 2: Integration Tier 3: E2E
───────────────────── ───────────────────── ─────────────────────
Duration: ~2 seconds Duration: ~30 seconds Duration: ~15 minutes
Cost: $0 Cost: $0 Cost: ~$5 (CI mins)
Scope: Functions/ Scope: CLI commands Scope: Browser +
modules + mock AWS (moto) AWS sandbox
Tools: pytest pytest + moto + Playwright +
+ fixtures LocalStack LocalStack/AWS
Gate: All tests PASS All tests PASS + All E2E PASS +
≥80% coverage Lighthouse ≥90
Running the Pipeline Locally
# Tier 1 — fast unit feedback (run on every save)
task test:unit
# Tier 2 — integration with mocked AWS
task test:integration
# Tier 3 — full E2E (run before PR)
task test:e2e
# Full progressive pipeline (Tier1 → Tier2 → Tier3, gate-chained)
task test:progressive
The same task commands run in CI. No "it works on my machine" discrepancies. act validates workflow YAML locally before push.
Quality Gate Reference
All 16 gates run on every pull request targeting main. Gates marked BLOCK fail the PR; gates marked ADVISORY post a warning comment.
| Gate | Tool | Threshold | Enforcement |
|---|---|---|---|
| Lint | ruff check | 0 errors | BLOCK |
| Format | ruff format --check | 0 diff | BLOCK |
| Unit tests | pytest tests/unit | 100% PASS | BLOCK |
| Integration tests | pytest tests/integration | 100% PASS | BLOCK |
| Coverage | pytest --cov | ≥80% (advisory until S5) | ADVISORY |
| Type checking | pyright | 0 errors on public API | ADVISORY |
| SAST | bandit -r src | 0 HIGH/CRITICAL | BLOCK |
| SCA | pip-audit | 0 CRITICAL CVEs | BLOCK |
| IaC security | checkov | 0 FAILED policies | BLOCK |
| Container scan | trivy | 0 HIGH/CRITICAL | BLOCK |
| Secrets | truffleHog | 0 detected | BLOCK |
| Infracost | infracost diff | ≤+5% cost delta | ADVISORY |
| Hook tests | bash tests/hooks/run-all-tests.sh | 190+ cases PASS | BLOCK |
| E2E smoke | Playwright @smoke | All PASS | BLOCK |
| Lighthouse | lhci autorun | ≥90 performance | ADVISORY |
| SBOM | cyclonedx-python | Generated | ADVISORY |
Compliance Mapping
Enterprise FSI buyers require demonstrable mapping from testing practice to regulatory obligation.
| Regulation | Requirement | ADLC Control | Evidence Artifact |
|---|---|---|---|
| APRA CPS 234 | Information security capability | SAST + SCA + secret scanning gates | CI scan reports |
| APRA CPS 234 | Incident response testing | Runbook tests + chaos scenarios | Test suite results |
| SOC 2 Type II | Change management | PR gates + HITL approval | Git history + PR log |
| SOC 2 Type II | Availability monitoring | MELT telemetry + SLOs | Dashboard exports |
| ISO 27001 | Vulnerability management | pip-audit weekly + SCA | Scan reports in CI |
| GDPR / Privacy Act | Data handling controls | bandit SQL injection rules | SAST report |
xOps Page Reference
The xOps voice-enabled RAG chatbot has its own landing page at /xops and a PR/FAQ at /docs/business-cases/xops-prfaq. Testing methodology for xOps specifically (including the 8/10 ops question accuracy target at ≤$180/month) follows the progressive pipeline above, with the addition of:
- RAG accuracy testing: LLM-as-Judge scoring against a golden question set
- Voice round-trip testing: speech → ASR → RAG → TTS → latency ≤3s P95
- Multi-account AWS fixture isolation via
motoaccount-level patching
Source files: docs/src/pages/xops.jsx (landing page), docs/docs/business-cases/xops-prfaq.mdx (PR/FAQ).
Evidence
- Coverage baseline:
pytest --covoutput 2026-03-17 — 33/33 PASS, 59.27% - Hook test suite:
tests/hooks/run-all-tests.sh— 9 suites, 190+ cases PASS - CLI tests:
framework/cli/— 337 tests, 12 suites PASS - Quality gates config:
.github/workflows/— 16 active gates - Anti-patterns list:
.claude/rules/adlc-governance.md— 33 patterns tracked - Constitution checkpoints:
.specify/memory/constitution.md— 58 checkpoints