Key Site Reliability Engineering (SRE) Practices
Key Site Reliability Engineering (SRE) Practices
Now that I lead an SRE team, I’ve been reflecting on what we should be responsible for, what skills we need, and what SRE means in our particular context. This post outlines my current working view.
We work to ensure that systems are reliable, scalable, and efficient. Our goal is to help teams meet their reliability targets e.g. maintaining 99.99% availability, while also minimising Mean Time to Recovery (MTTR) when incidents occur. We aim to improve system reliability and performance while managing cost, both in terms of infrastructure and, crucially, cognitive load.
Some core practices include:
Service Levels and Error Budgets
Define clear Service Level Objectives (SLOs) based on measurable Service Level Indicators (SLIs) such as latency or availability. Use error budgets to balance innovation and reliability, allowing controlled risk when deploying changes.
Monitoring and Observability
Employ monitoring to track metrics like traffic, latency, and errors, and observability to understand why issues occur through logs, traces, and detailed telemetry.
Incident Management and Postmortems
Establish structured incident response processes—alerting, escalation, and on-call rotations. Conduct blameless postmortems to learn from failures and improve resilience.
Chaos Engineering and Resilience Testing
Introduce controlled failures to test fault tolerance and validate recovery strategies, ensuring systems can withstand unexpected disruptions.
Cost and Efficiency Optimisation
Continuously monitor resource usage and optimise infrastructure for cost without compromising performance or reliability.
Collaboration and Culture
Foster a blameless, learning-oriented culture. Encourage transparency, maintain detailed runbooks, and share operational knowledge across teams.
Capacity Planning and Scalability
Forecast resource needs through capacity planning with users of our platforms, and use auto-scaling and load balancing to handle demand fluctuations without degradation.
Automation and Toil Reduction
Automate repetitive operational work, from deployments to recovery, to reduce toil and increase reliability. Manage environments through Infrastructure as Code (IaC) for consistency and repeatability.
Adjacent Responsibilities
In our setup, some responsibilities naturally sit with other teams, though they closely intersect with SRE.
Change and Release Management
Deploy changes safely using canary releases (or blue/green), feature flags, and gradual rollouts. These methods help limit impact and allow quick rollback if needed.
Security and Compliance
Integrate strong security practices—such as access control, encryption, and auditing—while maintaining compliance with relevant standards.
Core Skills for Effective SREs
A strong SRE combines technical depth with systems thinking and clear communication. The following skills help SREs succeed in delivering reliable, scalable services:
Technical Skills
- Systems engineering: Deep understanding of distributed systems, networking, and performance tuning.
- Programming and automation: Proficiency in languages such as Python, Go, and Bash for automating workflows and tooling.
- Infrastructure as Code (IaC): Experience with tools like Terraform, Ansible, or Pulumi for repeatable, versioned infrastructure.
- Monitoring and observability: Familiarity with systems such as Prometheus, Grafana, OpenTelemetry.
- Cloud platforms: Competence in AWS, GCP, or Azure for designing and managing scalable cloud-native systems.
- CI/CD and release automation: Understanding of build pipelines, deployment strategies, and rollback mechanisms.
Operational and Analytical Skills
- Incident response: Calm, structured problem-solving under pressure, with a focus on root cause analysis.
- Capacity and performance analysis: Ability to anticipate scaling challenges and design for cost efficiency.
- Reliability metrics: Understanding and applying SLOs, SLIs, and error budgets effectively.
Collaboration and Mindset
- Communication: Clear, concise documentation and the ability to collaborate across engineering, product, and operations teams.
- Continuous improvement: A mindset of curiosity, learning, and iterating based on feedback and post-incident insights.
- Blameless culture: Openness to discussing failures constructively and driving systemic improvement rather than assigning fault.
