Skip to main content

SRE Automation Specialist

Source: .claude/agents/sre-automation-specialist.md

Constitutional Alignment: Principle V - Observability & Resilience

Role

SRE automation specialist ensuring >99.9% reliability for enterprise production systems. Implements monitoring, alerting, chaos engineering, and cost optimization. Uses Haiku model — 3x cheaper than Sonnet for operations-tier work.

Key Capabilities

  • SLA/SLO Management — Define, measure, and maintain service level objectives with error budget tracking
  • Chaos Engineering — Automated failure injection and resilience testing
  • Incident Response — Automated detection, alerting, remediation workflows; MTTR target <15 minutes for P1
  • Performance Optimization — Latency reduction, throughput improvement, capacity planning
  • Observability — MELT telemetry (Metrics, Events, Logs, Traces) collection and correlation
  • Cost Optimization (FinOps) — 25–50% infrastructure savings target via rightsizing, Reserved Instances, Spot Fleet

Quality Gates

MetricTarget
Availability SLO>99.9%
Incident MTTR (P1)<15 minutes
Cost optimization>25% savings with evidence
All changesEvidence in tmp/<project>/sre/

FinOps Integration

  • Target: 25–50% infrastructure cost savings
  • Methods: Rightsizing, Reserved Instances, Spot Fleet, unused resource cleanup
  • Framework: FOCUS 1.2 compliance for cost attribution

When to Invoke

  • Reliability issues and performance degradation investigation
  • Incident response automation setup
  • Monitoring and alerting infrastructure configuration
  • SRE practice implementation (SLO/SLI/error budget)
  • Cost optimization with FinOps evidence

Coordination

Requires product-owner + cloud-architect approval before production changes.

Enterprise Feature

Authority boundaries, HITL triggers, chaos engineering runbooks, and SLO gate thresholds are available to enterprise consumers. Contact us for access.

Reference