Site Reliability Engineering

How Google runs production systems—and why 100% uptime is the wrong goal.

Most engineering teams treat operations as a necessary evil. Something breaks, someone fixes it, everyone moves on. This book argues that’s fundamentally wrong—and then shows you what Google does instead.

Site Reliability Engineering is less of a “how we do ops” manual and more of a philosophy book disguised as an engineering handbook. The core argument: reliability is a feature, and it deserves the same engineering rigor as any product feature you ship.

The idea that changed how I think about uptime

Here’s the concept that reframed everything for me: 100% availability is the wrong target.

That sounds counterintuitive. Shouldn’t we want our systems to never go down? The book makes a compelling case that no, we shouldn’t—because chasing 100% uptime means you can never ship anything. Every deploy is a risk. Every change could break something. If your goal is “never break,” your incentive is “never change.”

Instead, SRE introduces error budgets. It works like this:

You agree on a Service Level Objective (SLO)—say, 99.9% availability
That means you’re allowed 0.1% downtime—roughly 43 minutes per month
That 0.1% is your error budget—you can “spend” it on deploys, experiments, migrations
If you’ve burned through your budget, you slow down and focus on reliability
If you have budget left, you ship faster

The elegance is that it turns the classic tension between “move fast” and “don’t break things” into a measurable, negotiable tradeoff. Product teams and SRE teams stop arguing about whether to ship—they look at the error budget and the data decides.

SLOs are the contract between your service and your users. Not “we promise 100% uptime” but “we promise this specific level of reliability, measured this specific way.” It’s honest, it’s measurable, and it forces you to define what “reliable enough” actually means for your service.

Toil: the silent killer of engineering teams

The book defines toil as manual, repetitive, automatable work that scales linearly with service growth and has no enduring value. Think: manually restarting a service, hand-editing config files, running the same diagnostic steps every time an alert fires.

The key insight isn’t that toil is bad (everyone knows that). It’s that toil is a budget problem, not a discipline problem. Google’s rule: SREs should spend no more than 50% of their time on operational work. If toil creeps above that, something is structurally wrong and needs engineering effort to fix.

This reframing matters. It’s not “we should automate more” (vague, easy to ignore). It’s “we have a hard cap on ops work, and if we exceed it, we stop doing other things until we fix it.” That’s a policy with teeth.

The four golden signals

When it comes to monitoring, the book cuts through the noise with four metrics that matter for any service:

Latency — how long requests take (distinguish between successful and failed requests)
Traffic — how much demand is hitting your system
Errors — the rate of failed requests
Saturation — how full your service is (CPU, memory, disk, connections)

If you’re drowning in dashboards and alerts, start here. These four signals cover the vast majority of what you need to know about a service’s health. Everything else is detail.

The book’s stance on alerting is equally sharp: every alert should be actionable, and every page should require human intelligence. If an alert fires and the response is always “restart the service,” that’s not an alert—that’s an automation opportunity.

The cultural shift that makes it all work

The technical concepts are valuable, but the organizational ideas are what make SRE actually sustainable.

Blameless postmortems are the foundation. When something breaks, you write up what happened, what the impact was, and what you’ll do to prevent it—without pointing fingers at individuals. The goal is to fix the system, not punish the person. This sounds obvious, but it’s remarkably hard to do in practice. The book provides concrete templates and examples of how Google structures these reviews.

The reasoning is simple: if people fear blame, they hide mistakes. If they hide mistakes, you can’t learn from them. If you can’t learn, the same failures keep happening.

The SRE role itself is a cultural statement. SREs are software engineers who happen to work on reliability—not ops people who learned to code. The distinction matters because it means SREs are expected to write code that eliminates their own operational burden. If you’re doing the same manual task repeatedly, your job is to automate yourself out of it.

The 50% rule enforces this: at least half of an SRE’s time should go toward engineering projects (building tools, improving automation, reducing toil). If ops work consistently exceeds 50%, the team pushes back—either by handing operational load back to the development team or by staffing up.

Who should read this

If you run production systems—or build software that other people run—this book will change how you think about reliability, on-call, and the relationship between development and operations.

It’s long. It’s dense in places. Not every chapter will apply to your situation (some are very Google-specific). But the core ideas—error budgets, SLOs, toil management, blameless culture—are universally applicable and will make your systems and your team more resilient.

Read it if you’ve ever been in a meeting where product wants to ship faster and ops wants to slow down. The error budget framework alone is worth the read.