What is one difference between a small company and a big one?
The big one knows that “Everything fails all the time”.
There is nothing to hide, we just need to be aware of it.
When I studied for the AWS Solution Architect certification, I was surprised about how many times it is repeated in documentation that infrastructure fails, applications fail, everything fails.
Google introduced the concept of “Error Budget” to mitigate the conflict between stability and innovation.
In a company that provides services, the product team must establish the system’s availability target based on analysis on users.
What level of availability will still satisfy the users? What happens to users’usage when an outage occurs?
An outage must not be just something to blame, it should be a part of the process.
What is called SLA (Service Level Agreement) towards users, should be called “Error Budget” for engineers.
There is actually something positive in thinking about an error budget. You can fail and you can decide where to take risks and use your error budget.
Post inspired by “Site Reliability Engineering: How Google Runs Production Systems, by Niall Richard Murphy, Betsy Beyer”