99.99% availability. Four nines. It sounds like a marketing number, but it has a precise meaning: your service can be down for no more than 52 minutes per year. For a payment platform, an e-commerce site, or a logistics system, those 52 minutes can cost millions.
Most organisations say they want five-nines availability. Very few have the engineering practices to deliver it. Site Reliability Engineering (SRE) — originated at Google — is the discipline that makes it achievable.
This article walks through the core SRE practices we implement for enterprise clients, and how you can apply them regardless of your stack size.
What SRE Actually Is
SRE is what happens when you ask a software engineer to solve operational problems. Instead of writing runbooks and reacting to incidents, SREs build systems, automation, and tooling that make operations reliable by design.
The SRE model, in one sentence: treat operations like a software problem, with the same engineering rigour.
The key distinction from traditional ops:
- Traditional Ops: Manually monitor, manually respond, document runbooks that no one updates
- SRE: Define reliability targets mathematically, automate response, spend toil budget on eliminating the cause of problems
SLIs, SLOs, and Error Budgets
This is the foundation of SRE. Without it, reliability conversations are just opinions. With it, they become engineering decisions.
Service Level Indicators (SLIs)
An SLI is a precise measurement of some aspect of your service. Examples:
- Request latency: percentage of requests served in under 200ms
- Availability: percentage of requests that succeed (non-5xx response)
- Error rate: percentage of requests that return an error
- Freshness: percentage of data updated within the last 5 minutes
Service Level Objectives (SLOs)
An SLO is a target for your SLI. It's an internal reliability commitment, not a customer-facing one:
- "99.9% of requests will complete within 200ms over a 28-day rolling window"
- "Availability will be at least 99.95% over a 28-day rolling window"
"An SLO is not aspirational. It is a precise engineering target that drives real decisions about when to deploy, when to slow down feature work, and when to declare an incident."
Error Budgets
If your SLO is 99.9% availability, then 0.1% downtime is your error budget — about 43 minutes per month. This budget is the most powerful concept in SRE.
- If you have budget remaining: Deploy freely. Take risks. Move fast.
- If your budget is depleted: Feature work stops. All engineering focuses on reliability until the budget recovers.
Error budgets transform "reliability vs velocity" from a political argument into a mathematical one. The data tells you which mode you should be in.
Building Your Observability Stack
You cannot achieve reliability without deep observability. "Monitoring" tells you something is wrong. "Observability" tells you why.
The three pillars of observability:
Metrics
Quantitative measurements over time. The foundation of SLI tracking.
- Infrastructure metrics: CPU, memory, disk, network (node_exporter → Prometheus)
- Application metrics: request rate, error rate, latency, queue depth (Prometheus + custom instrumentation)
- Business metrics: orders per minute, active users, revenue per second
Tools: Prometheus + Grafana, Datadog, New Relic
Logs
Structured, searchable records of what happened. Logs answer "what" questions.
- Structure your logs as JSON — not free-text strings
- Include trace IDs to correlate logs across services
- Centralise with Elasticsearch/OpenSearch + Kibana, or Loki + Grafana
- Set retention policies: 30 days hot, 90 days warm, 1 year cold
Traces
Distributed traces follow a request across every service it touches. Essential for microservices. Answers "where" questions — which service in a chain of 12 added 800ms of latency?
Tools: Jaeger, Tempo, AWS X-Ray, Honeycomb
- Metrics: Prometheus → Grafana (with Thanos for long-term storage)
- Logs: Fluent Bit → Loki → Grafana
- Traces: OpenTelemetry → Tempo → Grafana
- Alerts: Alertmanager → PagerDuty → Slack
- Dashboards: Grafana with standardised team dashboards for every service
Alerting That Doesn't Cry Wolf
Alert fatigue is one of the most damaging problems in operations. When every alert is noise, real incidents get missed. SRE has a clear principle here:
Every alert must be actionable. If you can't describe exactly what to do when you receive an alert, the alert should not exist.
Alert design principles we enforce:
- Alert on symptoms, not causes: Alert when users are affected, not when a CPU threshold is hit
- Use SLO-based alerts: Alert when error budget burn rate is too high — not on arbitrary thresholds
- Multi-window alerts: Catch both fast-burning (1h) and slow-burning (6h) reliability problems
- Severity levels: Page immediately (P1), page within 30 min (P2), ticket next working day (P3)
- Regular alert reviews: Monthly audit of all alerts — delete anything that hasn't triggered a meaningful action in 3 months
Incident Response: From Chaos to Calm
Even with the best reliability engineering, incidents happen. The difference between mature and immature organisations is not the frequency of incidents — it's the quality of response.
The Incident Response Lifecycle
- Detection: Alert fires. On-call engineer acknowledged within 5 minutes.
- Triage: Determine severity (P1 all-hands vs P2 single engineer). Declare incident in Slack channel. Assign incident commander.
- Mitigation: Restore service first, investigate later. Rollback, traffic shifting, feature flags. MTTR over root cause during the incident.
- Resolution: Service restored. Document what happened in the incident timeline.
- Post-Incident Review (PIR): Blameless retrospective within 48 hours. Five whys analysis. Action items assigned with owners and due dates.
Runbooks and Playbooks
Every recurring alert type should have a runbook: a step-by-step guide for the on-call engineer. Good runbooks:
- Start with "what does this alert mean in plain English"
- Include specific commands to run, not vague instructions
- Include links to relevant dashboards and logs
- Have a decision tree for escalation
- Are stored in version control and reviewed quarterly
Toil Reduction: The SRE Virtuous Cycle
Toil is the repetitive, manual operational work that keeps the lights on but adds no lasting value. Deploying manually, rotating credentials by hand, resizing databases on request.
SRE teams have a rule: spend no more than 50% of time on toil. The other 50% must go to engineering projects that reduce future toil. This creates a virtuous cycle where reliability improves continuously.
Common toil elimination projects:
- Automating certificate rotation (previously: engineer manually rotates 40 certs quarterly)
- Self-service database provisioning via Terraform (previously: DBA ticket with 3-day SLA)
- Automated canary deployments with auto-rollback (previously: manual deploy + watch dashboards for 30 min)
- Automated capacity scaling (previously: on-call called at 2am when traffic spikes)
Chaos Engineering: Breaking Things on Purpose
You don't know how your system behaves under failure until you test it under controlled failure conditions. Chaos engineering does exactly this.
Starting small:
- Kill a random pod in Kubernetes during business hours. Does the application recover automatically?
- Introduce 200ms of artificial latency to a service. Does the upstream timeout gracefully or cascade?
- Take down a database replica. Does the application fail over without user impact?
Tools: LitmusChaos, Chaos Monkey, AWS Fault Injection Simulator
Start in staging. Graduate to production only when you have high confidence in your monitoring and rollback capabilities.
What 99.99% Uptime Actually Requires
To be clear about the engineering required to hit four nines:
- No single points of failure anywhere in the stack
- Multi-zone (ideally multi-region) deployment for critical services
- All deployments via canary or blue/green — never full cutover without validation
- P1 incident acknowledgement under 5 minutes, MTTR under 30 minutes
- Automated health checks and self-healing (Kubernetes liveness/readiness probes, auto-restart)
- Database failover under 30 seconds with no data loss
- CDN for all static assets and appropriate edge caching
"The goal of SRE is not to prevent all incidents. It's to ensure that when things go wrong — and they will — your system recovers faster than anyone notices."
Where to Start
If you're building SRE practices from scratch, we recommend this sequence:
- Define SLOs for your top 3 most critical user-facing services
- Instrument those services with the three observability pillars (metrics, logs, traces)
- Build SLO-based alerting and assign an on-call rotation
- Write runbooks for your 5 most common alert types
- Run your first game day / chaos experiment in staging
- Hold your first blameless post-incident review after the next P1
- Measure your toil and start your first toil-elimination project
SRE is a journey, not a destination. Every incident is a learning opportunity. Every piece of toil eliminated is reliability compounding over time.
If you'd like a free reliability review of your current infrastructure — including a gap analysis against SRE best practices — get in touch with our team.