Share This Page

Author: Ananthanatarajan Muthusamy |11/22/17

Monitoring large scale distributed computing services

When was initially launched for public on 1st October 2013, it had minimal or no monitoring set up[1]. Most of the TV networks in US were Live monitoring and reporting the website’s health. It resonates with my other conversation with the CEO of a large payroll company. The CEO would get calls directly from their enterprise customers, on their broken application. His internal team could not deliver his minimum expectation of setting off an alert that the service is impacted.

Today, Twitter has transformed as the primary public channel for real-time monitoring & reporting the availability of large cloud services. Enterprises are challenged to detect and mitigate an outage before it becomes a news.

This blog post focuses on the monitoring KPIs, a framework for monitoring, tuning monitoring configurations on an on-going basis and the changing role of an NOC engineer.


Monitoring became a key function, as the traditional on-premise services transitioned to 24x7, always available online, cloud services. Failure of a service in a traditional environment is localized and limited to customers in an enterprise company. But failure of service in a cloud service like Sales Force or Slack will have a global impact affecting millions of enterprise users.

Monitoring Goals & Metrics

Ways of effective monitoring and reporting results in proactive identification of potential degradation or failure, enabling operations to mitigate it before the customers can notice.

Here are the top goals for monitoring:

  1. Alert a potential, customer impacting outage
  2. Provide sufficient data and metrics, to diagnose the issue, develop, apply and verify successful mitigation
  3. Provide sufficient data and metrics for trending activities, analytics, building permanent fixes, reconfiguring and monitoring. Mature organizations use these data to train their monitoring system for predictive alerting
  4. Inform regularly that the services are running healthy and metrics are within thresholds

Here are the top monitoring metrics:

  1. Detection Time - Defined as time (in minutes) taken to detect and report an anomaly
  2. Monitoring Misses - Customer impacting outages, not detected by monitoring, reported by customers
  3. Noise - Monitoring events qualified as an incident and later closed as noise

The last 2 KPIs are contrarian in nature, but help to drive towards an appropriate alert volume. If you have more monitoring misses, you will have adverse market perception of your service, resulting in customer churn. If you loosen your monitoring thresholds, you will be flooded with large volume of events, challenging operations to qualify and will have to work on high severity events that could potentially result in a customer impact.

Monitoring Framework

Monitoring has evolved from traditional “internal monitoring” to include “external” customer signals as well. I have presented below a sample representation of monitoring framework.

Monitoring Framework - img

Configuring Monitoring Parameters & Thresholds Dynamically

Traditionally, it used to be simple mainstream definition for monitoring parameters (like CPU, Memory, Disk space) and thresholds for infrastructure. With new services, engineered primarily with software and software driven infrastructure, monitoring parameters and thresholds are customized.

Primary responsibility for defining and fine-tuning the monitoring parameters and thresholds has now moved to engineering teams. Based on the volume and noise, Operations should aggregate actionable inputs to Engineering to modify monitoring parameters and thresholds in a granular way on an ongoing basis. Operations can influence effective monitoring, by continuously tracking, reporting and improving the top KPIs (TTD, Monitoring Misses & Noise) defined earlier.

Automation & Changing Role of a Monitoring Engineer

The role of an NOC engineer was limited to watching a couple of monitoring consoles (eyes on glass”), looking for red and yellow events in their screens. Now, the role has transformed to manage and mitigate production outages real-time. Automation is leveraged to perform the functions that were previously delivered by the NOC engineer manually; the roles included:

  • Detection and creation of an alert in a ticketing system. NOC engineer need not see multiple monitoring consoles, but focus only on his ticketing system
  • Triaging an incoming alert and qualifying its severity
  • Gathering investigation logs and initial troubleshooting
  • Notifying affected users and stakeholders

An NOC Engineer is now focused on the following tasks:

  • Validating the alert triage and qualification
  • Qualifying the impact – scale (local, regional, global), services and customers
  • Co-relating events with signals from customers, outside-in and inside-out sources
  • Enriching events with situational awareness on change deployments
  • Mitigating the services or engaging with teams who can mitigate services

Extreme Monitoring

Enterprises have transformed from “reacting to systemic failures” to “injecting failures in the production”. Netflix has revolutionized this approach by injecting failures in their production services. The company has engineered a service called “Chaos Monkey” by randomly killing their AWS production servers since 2009. Their stated objective was to ensure that their services recover automatically without manual intervention. With “Chaos Monkey”, Netflix ensured that their services can deliver zero downtime successfully, mitigating two large AWS outages in 2011 and 2014[2].


Digital monitoring and evaluation solutions help in delivering nearly 100% service availability and retaining customers in this subscription economy, the same is validated in 2015 State of DevOps Report[3]. Monitoring has helped high-performance enterprises mitigate production outages 168 times faster than their peers, with Mean Time to Recover (MTTR) measured in minutes.

Interesting times are ahead, as monitoring is now focused on predicting a potential outage, leveraging Machine Learning. What are your monitoring experiences in detecting a potential outage before your customer could tweet in the social media? Please share your experiences in the comments section or write to us at to schedule a discussion with our connoisseurs.

[2] “The DevOps Handbook”, Ch.19, pg.281



Let's Talk About Your Needs

Thank you for your submission. We'll be in touch.