99.99% availability. Four nines. It sounds like a marketing number, but it has a precise meaning: your service can be down for no more than 52 minutes per year. For a payment platform, an e-commerce site, or a logistics system, those 52 minutes can cost millions.

Most organisations say they want five-nines availability. Very few have the engineering practices to deliver it. Site Reliability Engineering (SRE) — originated at Google — is the discipline that makes it achievable.

This article walks through the core SRE practices we implement for enterprise clients, and how you can apply them regardless of your stack size.

What SRE Actually Is

SRE is what happens when you ask a software engineer to solve operational problems. Instead of writing runbooks and reacting to incidents, SREs build systems, automation, and tooling that make operations reliable by design.

The SRE model, in one sentence: treat operations like a software problem, with the same engineering rigour.

The key distinction from traditional ops:

SLIs, SLOs, and Error Budgets

This is the foundation of SRE. Without it, reliability conversations are just opinions. With it, they become engineering decisions.

Service Level Indicators (SLIs)

An SLI is a precise measurement of some aspect of your service. Examples:

Service Level Objectives (SLOs)

An SLO is a target for your SLI. It's an internal reliability commitment, not a customer-facing one:

"An SLO is not aspirational. It is a precise engineering target that drives real decisions about when to deploy, when to slow down feature work, and when to declare an incident."

Error Budgets

If your SLO is 99.9% availability, then 0.1% downtime is your error budget — about 43 minutes per month. This budget is the most powerful concept in SRE.

Error budgets transform "reliability vs velocity" from a political argument into a mathematical one. The data tells you which mode you should be in.

Building Your Observability Stack

You cannot achieve reliability without deep observability. "Monitoring" tells you something is wrong. "Observability" tells you why.

The three pillars of observability:

Metrics

Quantitative measurements over time. The foundation of SLI tracking.

Tools: Prometheus + Grafana, Datadog, New Relic

Logs

Structured, searchable records of what happened. Logs answer "what" questions.

Traces

Distributed traces follow a request across every service it touches. Essential for microservices. Answers "where" questions — which service in a chain of 12 added 800ms of latency?

Tools: Jaeger, Tempo, AWS X-Ray, Honeycomb

Our Standard Observability Stack

Alerting That Doesn't Cry Wolf

Alert fatigue is one of the most damaging problems in operations. When every alert is noise, real incidents get missed. SRE has a clear principle here:

Every alert must be actionable. If you can't describe exactly what to do when you receive an alert, the alert should not exist.

Alert design principles we enforce:

Incident Response: From Chaos to Calm

Even with the best reliability engineering, incidents happen. The difference between mature and immature organisations is not the frequency of incidents — it's the quality of response.

The Incident Response Lifecycle

  1. Detection: Alert fires. On-call engineer acknowledged within 5 minutes.
  2. Triage: Determine severity (P1 all-hands vs P2 single engineer). Declare incident in Slack channel. Assign incident commander.
  3. Mitigation: Restore service first, investigate later. Rollback, traffic shifting, feature flags. MTTR over root cause during the incident.
  4. Resolution: Service restored. Document what happened in the incident timeline.
  5. Post-Incident Review (PIR): Blameless retrospective within 48 hours. Five whys analysis. Action items assigned with owners and due dates.

Runbooks and Playbooks

Every recurring alert type should have a runbook: a step-by-step guide for the on-call engineer. Good runbooks:

Toil Reduction: The SRE Virtuous Cycle

Toil is the repetitive, manual operational work that keeps the lights on but adds no lasting value. Deploying manually, rotating credentials by hand, resizing databases on request.

SRE teams have a rule: spend no more than 50% of time on toil. The other 50% must go to engineering projects that reduce future toil. This creates a virtuous cycle where reliability improves continuously.

Common toil elimination projects:

Chaos Engineering: Breaking Things on Purpose

You don't know how your system behaves under failure until you test it under controlled failure conditions. Chaos engineering does exactly this.

Starting small:

Tools: LitmusChaos, Chaos Monkey, AWS Fault Injection Simulator

Start in staging. Graduate to production only when you have high confidence in your monitoring and rollback capabilities.

What 99.99% Uptime Actually Requires

To be clear about the engineering required to hit four nines:

"The goal of SRE is not to prevent all incidents. It's to ensure that when things go wrong — and they will — your system recovers faster than anyone notices."

Where to Start

If you're building SRE practices from scratch, we recommend this sequence:

  1. Define SLOs for your top 3 most critical user-facing services
  2. Instrument those services with the three observability pillars (metrics, logs, traces)
  3. Build SLO-based alerting and assign an on-call rotation
  4. Write runbooks for your 5 most common alert types
  5. Run your first game day / chaos experiment in staging
  6. Hold your first blameless post-incident review after the next P1
  7. Measure your toil and start your first toil-elimination project

SRE is a journey, not a destination. Every incident is a learning opportunity. Every piece of toil eliminated is reliability compounding over time.

If you'd like a free reliability review of your current infrastructure — including a gap analysis against SRE best practices — get in touch with our team.