Testing & Quality
Enterprise-grade AI agent governance requires enterprise-grade testing. This page summarises the ADLC testing strategy and improvement map for ANZ FSI/Energy/Telecom/Aviation targets.
Enterprise Standards Checklist (2026-2030)
| Standard | Current State | Target | Status |
|---|---|---|---|
| Test Coverage | 59.27% (2026-03-17) | ≥80% | In Progress |
| 3-Tier Progressive Pipeline | Tier 1+2 automated, Tier 3 manual | Fully automated Tier 1→2→3 | Partial |
| BDD Given/When/Then docstrings | Introduced Sprint 1 | All public APIs | In Progress |
SAST (bandit) | CI gate live | 0 HIGH/CRITICAL blocking | In Progress |
SCA (pip-audit) | CI gate live | Weekly scheduled + gate | In Progress |
SBOM (cyclonedx) | Planned | Every tagged release | Planned |
Secret scanning (truffleHog) | Planned | Pre-commit + CI | Planned |
Docker-first CI (act validated) | Adopted | All workflows | Done |
| DORA metrics (all 4) | Collected via hooks | Automated per-sprint | Partial |
| APRA CPS 234 (FSI) | Documented | Auditable | In Progress |
| Pinned SHA digests (actions) | Partial | All workflows | In Progress |
| Wolfi/distroless containers | Adopted for E2E | All images | In Progress |
| Agent consensus quality | 96% achieved (S1) | ≥95% per change | Enforced |
pytest: 33/33 PASS, 59.27% coverage (2026-03-17). CI gates: 16 active. Hook tests: 190+ cases, 9 suites PASS.
TDD Approach
Test-Driven Development is used as the primary design technique. Passing tests are a by-product; modular, injectable interfaces are the goal.
1. RED — Write a failing test that specifies the behaviour
2. GREEN — Write the minimum code to make the test pass
3. REFACTOR — Clean up while tests protect
4. REPEAT
# Step 1: RED — test describes desired behaviour
def test_list_ec2_instances_returns_table(runner, mock_boto3):
"""Given valid AWS credentials, when listing EC2, then render a Rich table."""
result = runner.invoke(cli, ["ec2", "list"])
assert result.exit_code == 0
assert "Instance ID" in result.output
# Step 2: GREEN — minimal implementation
# Step 3: REFACTOR — extract table rendering to rich_utils.py
BDD Approach
Behaviour-Driven Development converts tests from implementation checks into business-readable contracts.
Given a multi-account org with 5 AWS accounts
When querying costs for March 2026
Then all 5 accounts are aggregated with correct totals
def test_cost_explorer_returns_monthly_summary(runner, mock_mcp):
"""
Given: A valid AWS profile with Cost Explorer access
When: The user runs `runbooks cost monthly --profile dev`
Then: A table showing service-level costs is rendered
And: Total cost is displayed in the summary row
And: Exit code is 0
"""
result = runner.invoke(cli, ["cost", "monthly", "--profile", "dev"])
assert result.exit_code == 0
assert "Amazon EC2" in result.output
assert "Total" in result.output
Progressive Testing Pipeline
Tier 1 (Unit, ~2s, $0) → Tier 2 (Integration, ~30s, $0) → Tier 3 (E2E, ~15min, ~$5)
Gates are additive: Tier 2 runs only if Tier 1 passes; Tier 3 only if Tier 2 passes.
| Tier | Scope | Tools | Gate |
|---|---|---|---|
| 1 — Unit | Functions, modules | pytest + fixtures | All tests PASS |
| 2 — Integration | CLI commands, mock AWS | pytest + moto + LocalStack | All PASS + ≥80% coverage |
| 3 — E2E | Browser + AWS sandbox | Playwright + LocalStack | All E2E PASS + Lighthouse ≥90 |
task test:unit # Tier 1 — run on every save
task test:integration # Tier 2 — run before commit
task test:e2e # Tier 3 — run before PR
task test:progressive # Full gate-chained pipeline
Component Coverage Map
| Component | Count | Test Type | Status |
|---|---|---|---|
| Core Agents | 10 | Frontmatter schema + behavioural spec | Validated |
| Commands | 74 | Frontmatter schema validation | Validated |
| Hooks | 22 | Functional tests — 22/22 tested (300+ cases) | PASS |
| Skills | 20 | Referenced file existence validated | Validated |
| Settings / MANIFEST | 2 | Cross-validated (settings ↔ hooks ↔ agents) | PASS |
| MCPs | 58 | JSON schema + connectivity shape | Valid JSON |
| CLI source modules | 22 | pytest unit + integration | 59% (baseline) |
| E2E / Playwright | 22 tests | Browser automation against Docusaurus | PASS |
59% source coverage is below the 80% enterprise target. Active remediation: CloudOps-Runbooks S5 moto test unskipping + quality gate fixes. Target: ≥80% by end of S5.
Quality Gates (All 16)
| Gate | Tool | Threshold | Enforcement |
|---|---|---|---|
| Lint | ruff check | 0 errors | BLOCK |
| Format | ruff format --check | 0 diff | BLOCK |
| Unit tests | pytest tests/unit | 100% PASS | BLOCK |
| Integration tests | pytest tests/integration | 100% PASS | BLOCK |
| Coverage | pytest --cov | ≥80% (advisory until S5) | ADVISORY |
| Type checking | pyright | 0 errors on public API | ADVISORY |
| SAST | bandit -r src | 0 HIGH/CRITICAL | BLOCK |
| SCA | pip-audit | 0 CRITICAL CVEs | BLOCK |
| IaC security | checkov | 0 FAILED policies | BLOCK |
| Container scan | trivy | 0 HIGH/CRITICAL | BLOCK |
| Secrets | truffleHog | 0 detected | BLOCK |
| Infracost | infracost diff | ≤+5% cost delta | ADVISORY |
| Hook tests | bash tests/hooks/run-all-tests.sh | 190+ cases PASS | BLOCK |
| E2E smoke | Playwright @smoke | All PASS | BLOCK |
| Lighthouse | lhci autorun | ≥90 performance | ADVISORY |
| SBOM | cyclonedx-python | Generated | ADVISORY |
Further Reading
- Enterprise-Grade Testing & Quality Roadmap — full 4-level pyramid, compliance mapping, and 2026-2030 timeline
- DORA Metrics Target Framework — deployment frequency, lead time, MTTR, change failure rate
- Quality Baseline Assessment Template — how to run an honest quality audit
- Evaluation-First Principle — constitutional requirement for 100% test coverage