# Preface
[[Site Reliability Engineering]]
# Introduction
Benjamin Treynor Sloss, originator of term SRE
> Hope is not a strategy.
Systems do not run themselves, so how should they be run?
- [[Sysadmin Approach to Service Management]]
- [[The Google Approach to Service Management]]
## Tenets of SRE
SREs are responsible for a service’s:
- availability
- latency
- performance
- efficiency
- change management
- monitoring
- emergency response
- capacity planning
### Ensuring a Durable Focus on Engineering
Monitoring of SRE operational work to ensure excess is redirected to product development teams
SRE shouldn’t receive more than two pager events per 8-12 hour shift, allowing the engineer to:
- handle events accurately and properly
- clean up and restore normal service
- conduct a postmortem
More events will degrade the quality of work, preventing problems from being investigated properly, adding to technical debt
Conversely, if SREs consistently get less than one event per shift, why are they on call?
Postmortems should be written for all significant incidents, establishing in detail:
- what happened
- root causes
- corrective actions
- improvements in how to address the problem next time
Fostering a blame-free postmortem culture
Goal of exposing faults and fixing them, rather than avoiding or minimizing them
### Pursuing Maximum Change Velocity Without Violating a Service’s SLO
Conflict between dev and SRE teams:
pace of innovation and product stability
[[Error Budget]]
- resolves conflict between dev and SRE teams
- SRE doesn’t promise “zero outages”
- both teams spend error budget gettinfg maximum feature velocity
- outages are no longer feared, but expected and accepted
### Monitoring
- Alerts
- require immediate human action to respond to a current or upcoming event to improve the situation
- Tickets
- require eventual human action, system cannot automatically handle the situation, but will not incur damage if action isn’t taken immediately
- Logging
- data collection for diagnostic and forensic purposes\
### Emergency Response
Reliability is a function of mean time to failure (MTTF) and mean time to repair (MTTR)
MTTR is more relevant for returning a service back to health
Humans add latency
A system that experiences more failures, but can address them itself will have higher availability than one that requires hands-on intervention
If humans are needed, using a playbook, instead of “winging-it” is more effective
### Change Management
~70% of outages are due to changes in a live system
mitigate this through:
- progressive rollouts
- quickly and accurately detecting problems
- rolling back changes safely if problems arise
### Demand Forecasting and Capacity Planning
Ensuring there is enough capacity and redundancy to meet the forecasted demand
Account for:
- organic growth
- natural production adoption and usage
- inorganic growth
- feature launches
- marketing campaigns
Mandatory steps in capacity planning:
- accurately forecasting organic and inorganic demand sources
- regularly load testing systems to meet forecasted capacity
### Provisioning
Adding capacity as a result of change management and capacity planning
Should be done quickly and only when necessary
Includes:
- spinning up new instance or location
- making significant changes to existing systems
- validating new capacity operates as planned
### Efficiency and Performance
SRE team controls provisioning, and therefore must be involved in work with utilization
Utilization is a function of how a service works and how it is provisioned
Resource use is mainly a function of:
- demand (load)
- capacity
- software efficiency
All of which SRE predict or act upon
Systems become slower as loaded is added → slowdown = loss in capacity
SREs provision to meet a capacity target at a specific response speed
[[The Four Golden Signals]]