## The Google Approach to Service Management
Software engineers perform sysadmin functions
What is SRE?
It’s what happens when you ask a software engineer to design an operations team.
Composition:
- 50-60% Google software engineers
- 40-50% candidates close to Google software engineer qualifications with strong sysadmin skills (i.e. UNIX system internals and networking)
Want a team of people that:
1. Get bored of doing things by hand
2. Can write software to automate things by hand
Focus on constant engineering:
- without it, operations load increases, more staff needed to keep up
- ops group will grow linearly with service size, then with traffic
Google places 50% cap on ops work for all SREs (tickets, manual tasks, etc.) to ensure time is spent writing software keeping the service stable and operable
Cap is an upper bound, ops work should become self-sustainable and SREs should almost entirely work on development tasks
Google has found SRE approach works well for running large-scale systems.
Goal is to make Google’s systems run themselves, so teams invite:
- rapid innovation
- large acceptance of change
Number of SREs scales sublinearly with system size
Easy transfers between dev and ops teams due to shared skillsets
Difficulties: limited talent pool, SRE and development candidates overlap
[[DevOps or SRE?]]