Google SRE Book - Davey's Digital Den

# Preface [[Site Reliability Engineering]] # Introduction Benjamin Treynor Sloss, originator of term SRE > Hope is not a strategy. Systems do not run themselves, so how should they be run? - [[Sysadmin Approach to Service Management]] - [[The Google Approach to Service Management]] ## Tenets of SRE SREs are responsible for a service’s: - availability - latency - performance - efficiency - change management - monitoring - emergency response - capacity planning ### Ensuring a Durable Focus on Engineering Monitoring of SRE operational work to ensure excess is redirected to product development teams SRE shouldn’t receive more than two pager events per 8-12 hour shift, allowing the engineer to: - handle events accurately and properly - clean up and restore normal service - conduct a postmortem More events will degrade the quality of work, preventing problems from being investigated properly, adding to technical debt Conversely, if SREs consistently get less than one event per shift, why are they on call? Postmortems should be written for all significant incidents, establishing in detail: - what happened - root causes - corrective actions - improvements in how to address the problem next time Fostering a blame-free postmortem culture Goal of exposing faults and fixing them, rather than avoiding or minimizing them ### Pursuing Maximum Change Velocity Without Violating a Service’s SLO Conflict between dev and SRE teams: pace of innovation and product stability [[Error Budget]] - resolves conflict between dev and SRE teams - SRE doesn’t promise “zero outages” - both teams spend error budget gettinfg maximum feature velocity - outages are no longer feared, but expected and accepted ### Monitoring - Alerts - require immediate human action to respond to a current or upcoming event to improve the situation - Tickets - require eventual human action, system cannot automatically handle the situation, but will not incur damage if action isn’t taken immediately - Logging - data collection for diagnostic and forensic purposes\ ### Emergency Response Reliability is a function of mean time to failure (MTTF) and mean time to repair (MTTR) MTTR is more relevant for returning a service back to health Humans add latency A system that experiences more failures, but can address them itself will have higher availability than one that requires hands-on intervention If humans are needed, using a playbook, instead of “winging-it” is more effective ### Change Management ~70% of outages are due to changes in a live system mitigate this through: - progressive rollouts - quickly and accurately detecting problems - rolling back changes safely if problems arise ### Demand Forecasting and Capacity Planning Ensuring there is enough capacity and redundancy to meet the forecasted demand Account for: - organic growth - natural production adoption and usage - inorganic growth - feature launches - marketing campaigns Mandatory steps in capacity planning: - accurately forecasting organic and inorganic demand sources - regularly load testing systems to meet forecasted capacity ### Provisioning Adding capacity as a result of change management and capacity planning Should be done quickly and only when necessary Includes: - spinning up new instance or location - making significant changes to existing systems - validating new capacity operates as planned ### Efficiency and Performance SRE team controls provisioning, and therefore must be involved in work with utilization Utilization is a function of how a service works and how it is provisioned Resource use is mainly a function of: - demand (load) - capacity - software efficiency All of which SRE predict or act upon Systems become slower as loaded is added → slowdown = loss in capacity SREs provision to meet a capacity target at a specific response speed [[The Four Golden Signals]]