Skip to main content

Enterprise-Grade Testing & Quality Roadmap

Test early, test always, test automatically — 10x cheaper at test time than in production.

This page covers the testing philosophy, toolchain, and maturity roadmap that underpins every ADLC project. It answers two questions for enterprise buyers: what standard are we targeting and why does it matter commercially.


Enterprise Standards Checklist (2026-2030)

DomainStandardCurrent StateTarget StateWhen
TestingCoverage ≥80%, 3-tier progressive59% (baseline)≥80%2026
TestingTDD discipline — red/green/refactorIntroduced S1Sprint default2026
TestingBDD Given/When/Then docstringsIntroduced S1All public APIs2026
SecuritySAST — bandit static analysisCI gate live0 HIGH/CRITICAL2026
SecuritySCA — pip-audit dependency scanCI gate liveWeekly scheduled2026
SecuritySBOM generation (cyclonedx)PlannedEvery release2026-Q3
SecuritySecret scanning (truffleHog)PlannedPre-commit + CI2026-Q3
CI/CDDocker-first act local validationAdoptedAll workflows2026
CI/CDGitOps — branch protection + signed tagsPartialMain branch gated2026
CI/CDPinned SHA digests for all actionsPlannedAll workflows2026-Q3
ObservabilityMELT telemetry (Metrics/Events/Logs/Traces)PlannedPer SLO2026-Q4
ObservabilitySLO definitions with error budgetsPlannedPer service2027
ObservabilityDORA metrics — all 4 trackedDORA capturedAutomated2026
ComplianceAPRA CPS 234 (FSI)DocumentedAuditable2027
ComplianceSOC 2 Type II controlsDocumentedCertified2027-Q3
Supply ChainPinned dependencies with hashesPartial (uv.lock)All projects2026
Supply ChainWolfi/distroless base containersAdopted for E2EAll containers2026-Q3
Supply ChainSLSA Level 2+ (provenance attestation)PlannedRelease pipeline2027
Supply ChainReproducible buildsPlannedSLSA L32028
Current Baseline

pytest: 33/33 PASS, 59.27% coverage (as of 2026-03-17). CI gates: 16 active quality gates. Hook tests: 190+ cases, 9 suites PASS.


TDD Business Value

Test-Driven Development is not a testing technique — it is a design technique that produces testable software as a by-product.

TDD Value Table

DimensionValueEvidence
Defect PreventionCatch bugs before release5 source-code bugs found during TDD red phase (S1)
Regression SafetyTests as a safety net for refactoring653 tests → 5,900+ tests this sprint
Refactoring ConfidenceBDD tests as living specificationGiven/When/Then docstrings verified against behaviour
Release VelocityCI gates block broken code from merging16 quality gates — zero broken merges to main
Cost Reduction10x cheaper to fix at test timevs production incidents (IBM Systems Sciences Institute)
Design ClarityForces modular, injectable interfacesClick CLI commands testable in isolation

TDD Cycle Applied to ADLC

1. RED   — Write a failing test that specifies the behaviour
2. GREEN — Write the minimum code to make the test pass
3. REFACTOR — Clean up while keeping tests green
4. REPEAT
# Example: TDD for a CloudOps CLI command

# Step 1: RED — test first
def test_list_ec2_instances_returns_table(runner, mock_boto3):
"""Given valid AWS credentials, when listing EC2, then render a Rich table."""
result = runner.invoke(cli, ["ec2", "list"])
assert result.exit_code == 0
assert "Instance ID" in result.output

# Step 2: GREEN — minimal implementation to pass
# Step 3: REFACTOR — extract table rendering to rich_utils.py

BDD Business Value

Behaviour-Driven Development elevates tests from implementation checks to business-readable contracts.

BDD Value Table

DimensionValueApplication
Living DocumentationTests ARE the API specificationTest function names and docstrings are the spec
Stakeholder AlignmentBusiness users can read and verify behaviourGiven/When/Then readable without Python knowledge
API Contract TestingLocks public API shape against regressionClick CLI --help output and parameter names are contract
Rich CLI Docspdoc auto-generates HTML from docstringstask docs:api → browseable API reference
Acceptance Criteria TraceabilityUser story ACs map to test functionsEach AC reference tagged in test docstring

BDD Pattern in ADLC Projects

def test_cost_explorer_returns_monthly_summary(runner, mock_mcp):
"""
Given: A valid AWS profile with Cost Explorer access
When: The user runs `runbooks cost monthly --profile dev`
Then: A table showing service-level costs is rendered
And: Total cost is displayed in the summary row
And: Exit code is 0
"""
result = runner.invoke(cli, ["cost", "monthly", "--profile", "dev"])
assert result.exit_code == 0
assert "Amazon EC2" in result.output
assert "Total" in result.output
Living Documentation

When a BDD test fails, the failure message names the broken behaviour in business terms — not just a line number. This makes triage faster for both engineers and HITL managers.


ADLC Component Coverage Map

Every framework component is validated, not just source code. The coverage map tracks what has tests and what does not.

ComponentCountTest TypeStatus
Core Agents10Frontmatter schema + behavioural specValidated
Commands74Frontmatter schema validationValidated
Hooks22Functional tests — 22/22 tested (300+ cases)PASS
Skills20Referenced file existence validatedValidated
Settings / MANIFEST2Cross-validated (settings ↔ hooks ↔ agents)PASS
MCPs58JSON schema + connectivity shapeValid JSON
CLI source code22 modulespytest unit + integration59% coverage (baseline)
E2E / Playwright22 testsBrowser automation against DocusaurusPASS
Coverage Gap

59% source coverage is below the 80% enterprise target. The active remediation track is CloudOps-Runbooks S5: moto test unskipping + quality gate fixes. Target: ≥80% by end of S5.


4-Level Enterprise Standards Pyramid

Quality matures in layers. Each level depends on the one below it being stable.

         ┌─────────────────────────────────┐
│ Level 4: AI Governance │ 2027-2030
│ Agent audit trails │
│ LLM-as-Judge evaluation │
│ Constitutional AI enforcement │
└─────────────────────────────────┘
┌───────────────────────────────────┐
│ Level 3: Supply Chain Security │ 2026-2027
│ SLSA Level 3 │
│ Signed releases (Sigstore) │
│ Reproducible builds │
│ Wolfi base images │
└───────────────────────────────────┘
┌─────────────────────────────────────┐
│ Level 2: Security Hardening │ 2025-2026
│ SAST (bandit) │
│ SCA (pip-audit) │
│ SBOM (cyclonedx) │
│ Secret scanning (truffleHog) │
│ Pinned action SHA digests │
└─────────────────────────────────────┘
┌───────────────────────────────────────┐
│ Level 1: Code Quality Foundation │ 2024-2026
│ Linting — ruff (0 errors) │
│ Type checking — pyright/mypy │
│ Coverage ≥80% (pytest --cov) │
│ TDD red/green/refactor discipline │
│ BDD Given/When/Then docstrings │
└───────────────────────────────────────┘

Level Detail

Level 1 — Code Quality (Foundation, 2024-2026)

The baseline that enables everything else. Without passing linting and working tests, security scans produce false positives and supply chain tooling cannot attest builds.

ToolRuleEnforcement
ruff check0 errors (warnings allowed)CI gate — blocks merge
pyright / mypyNo untyped public functionsCI advisory (becoming gate)
pytest --cov≥80% line coverageCI gate (currently advisory at 59%)
ruff formatConsistent formattingPre-commit hook

Level 2 — Security Hardening (2025-2026)

Shift security left: every commit is scanned, not just releases.

ToolPurposeScope
banditSAST — Python code patterns0 HIGH/CRITICAL to merge
pip-auditSCA — known CVEs in depsWeekly scheduled + PR trigger
cyclonedx-pythonSBOM generationEvery tagged release
truffleHogSecret scanningPre-commit + CI on push
Pinned SHA digestsActions supply chainAll .github/workflows/

Level 3 — Supply Chain Security (2026-2027)

Prove the build artifact matches the source. Required for APRA CPS 234 attestation.

StandardMechanismTarget
SLSA Level 2GitHub Actions provenance2026-Q4
SLSA Level 3Isolated build environment2027
Sigstore / cosignContainer + release signing2027
Wolfi base imagesCVE-minimal containers2026-Q3
Reproducible buildsDeterministic wheel builds2028

Level 4 — AI Governance (2027-2030)

ADLC-specific: governing AI agents is a distinct discipline from governing software.

CapabilityMechanismWhy It Matters
Agent audit trailsJSON evidence in tmp/ → promoted to gitRegulators require agent decision traceability
LLM-as-Judge evaluationAutomated scoring of agent output qualityReplaces manual HITL review for routine quality checks
Constitutional AI enforcement58 checkpoints, 33 anti-patterns, hook guardsGovernance embedded in process, not policed after the fact
Drift detectionScheduled re-evaluation of deployed agentsAgents degrade without retraining signals

Progressive Testing Pipeline

Three tiers ensure fast feedback during development and deep validation before release. The pipeline gates are additive: Tier 2 only runs if Tier 1 passes.

Tier 1: Unit Tests          Tier 2: Integration         Tier 3: E2E
───────────────────── ───────────────────── ─────────────────────
Duration: ~2 seconds Duration: ~30 seconds Duration: ~15 minutes
Cost: $0 Cost: $0 Cost: ~$5 (CI mins)
Scope: Functions/ Scope: CLI commands Scope: Browser +
modules + mock AWS (moto) AWS sandbox
Tools: pytest pytest + moto + Playwright +
+ fixtures LocalStack LocalStack/AWS
Gate: All tests PASS All tests PASS + All E2E PASS +
≥80% coverage Lighthouse ≥90

Running the Pipeline Locally

# Tier 1 — fast unit feedback (run on every save)
task test:unit

# Tier 2 — integration with mocked AWS
task test:integration

# Tier 3 — full E2E (run before PR)
task test:e2e

# Full progressive pipeline (Tier1 → Tier2 → Tier3, gate-chained)
task test:progressive
Local = CI

The same task commands run in CI. No "it works on my machine" discrepancies. act validates workflow YAML locally before push.


Quality Gate Reference

All 16 gates run on every pull request targeting main. Gates marked BLOCK fail the PR; gates marked ADVISORY post a warning comment.

GateToolThresholdEnforcement
Lintruff check0 errorsBLOCK
Formatruff format --check0 diffBLOCK
Unit testspytest tests/unit100% PASSBLOCK
Integration testspytest tests/integration100% PASSBLOCK
Coveragepytest --cov≥80% (advisory until S5)ADVISORY
Type checkingpyright0 errors on public APIADVISORY
SASTbandit -r src0 HIGH/CRITICALBLOCK
SCApip-audit0 CRITICAL CVEsBLOCK
IaC securitycheckov0 FAILED policiesBLOCK
Container scantrivy0 HIGH/CRITICALBLOCK
SecretstruffleHog0 detectedBLOCK
Infracostinfracost diff≤+5% cost deltaADVISORY
Hook testsbash tests/hooks/run-all-tests.sh190+ cases PASSBLOCK
E2E smokePlaywright @smokeAll PASSBLOCK
Lighthouselhci autorun≥90 performanceADVISORY
SBOMcyclonedx-pythonGeneratedADVISORY

Compliance Mapping

Enterprise FSI buyers require demonstrable mapping from testing practice to regulatory obligation.

RegulationRequirementADLC ControlEvidence Artifact
APRA CPS 234Information security capabilitySAST + SCA + secret scanning gatesCI scan reports
APRA CPS 234Incident response testingRunbook tests + chaos scenariosTest suite results
SOC 2 Type IIChange managementPR gates + HITL approvalGit history + PR log
SOC 2 Type IIAvailability monitoringMELT telemetry + SLOsDashboard exports
ISO 27001Vulnerability managementpip-audit weekly + SCAScan reports in CI
GDPR / Privacy ActData handling controlsbandit SQL injection rulesSAST report

xOps Page Reference

The xOps voice-enabled RAG chatbot has its own landing page at /xops and a PR/FAQ at /docs/business-cases/xops-prfaq. Testing methodology for xOps specifically (including the 8/10 ops question accuracy target at ≤$180/month) follows the progressive pipeline above, with the addition of:

  • RAG accuracy testing: LLM-as-Judge scoring against a golden question set
  • Voice round-trip testing: speech → ASR → RAG → TTS → latency ≤3s P95
  • Multi-account AWS fixture isolation via moto account-level patching

Source files: docs/src/pages/xops.jsx (landing page), docs/docs/business-cases/xops-prfaq.mdx (PR/FAQ).


Evidence

  • Coverage baseline: pytest --cov output 2026-03-17 — 33/33 PASS, 59.27%
  • Hook test suite: tests/hooks/run-all-tests.sh — 9 suites, 190+ cases PASS
  • CLI tests: framework/cli/ — 337 tests, 12 suites PASS
  • Quality gates config: .github/workflows/ — 16 active gates
  • Anti-patterns list: .claude/rules/adlc-governance.md — 33 patterns tracked
  • Constitution checkpoints: .specify/memory/constitution.md — 58 checkpoints