Responding to Operator Errors
Last March, Amazon Web Services in US East 1 was disrupted by an operator error. Typo by an authorized operator, as an input in one of the tool was identified as the root cause of this outage. Outage lasting for 4 hrs has brought down no. of large services. Interestingly their own status dashboard, depending on their Storage was also impacted (AWS Status Update).
We recognize Amazon as one of the top notch, highly efficient cloud engineering organization. They have published multiple case studies and best practices, perfected based on how they operate their AWS Cloud. For this specific outage, Amazon has published a detailed report and corrective actions.
We want to touch upon 2 aspects in this note,
- Operator caused outages can be reduced (Eliminating operator errors can be a lofty, north star goal)
- Cultural aspects of responding to operator caused outages
Reducing Operator caused Outages
We have been providing production services for large Cloud Services for our enterprise and service provider customers. We always have critical operator tasks in mitigating an outage, routine maintenance, deploying software in the production. We have had outages caused by these operator errors and using these learnings we have developed a high fidelity risk scoring method. With this score card, we identify hi-risk SOPs and eliminate or automate them. Remaining hi-risk SOP tasks that are to be executed manually are addressed by experts under peer review.
With this approach, we are able to reduce the production outages caused by operator errors.
Responding to an operator caused outage
Despite our best efforts, we still need to be prepared for responding to an operator caused outage. While the immediate focus is to mitigate the outage and restore the service. After restoring the service, the focus shifts to the post mortem exercise. We have seen cases of levying severe penalties on the individuals / teams for their outages. This punishing approach creates a sense of fear, resulting in failures being covered up.
We have adopted some of the industry best practices and implemented blameless post mortems. Instead of zooming on faulty individual, we shift the focus on the time stamp of actual events and impact, systemic vulnerabilities and process gaps. Developing solutions based on these learnings and repair items, helped us reduce the operator caused outages substantially .
We have presented below one of our success story. As part of our production support of a large cloud services, we have a SLA penalty for outages caused by operators.
We used to have an outage almost every alternate month. By adopting the best practices detailed in this note, we have eliminated operator caused outages for the last 2 years.
Exciting times ahead for operators as more and more enterprise and government workloads move to cloud. As Operators, we continue to be prepared for outages and work towards reducing the time to mitigate (TTM).