Self-healing IT systems: Less of a choice, more a mandatory requirement
Imagine, for a moment, that you had to run, maybe to catch a bus. Your heartbeat will increase to enable you to run faster. This is normal. But what if the heartbeat doesn’t come down even after you’re seated comfortably inside the bus? That could prove to be fatal. Thankfully, the body’s self-healing mechanism comes to the rescue here. It monitors any abnormality in the body function and restores it back to normalcy.
Here, the self-healing is a built-in mechanism that keeps the system in order. Modern-day IT systems are somewhat similar to a human body in terms of complexity and the need for uninterrupted uptime. They comprise multiple systems and interconnections, all of which need to be available 24X7 for seamless operations. This, as any IT expert will tell you, is simply not possible. IT systems cannot be completely perfect all the time and do deviate from normal functionality every now and then. But what does this mean for your business?
The need self-healing systems: The business cost of system downtimes
In the aforementioned example, if your body’s self-healing response doesn’t kick in, the results can be disastrous. The same is the case with IT systems. Without a self-healing mechanism in place, organizations are bound to face system downtime. Mission-critical applications might malfunction or fail to run entirely. This leads to hampered productivity, heavy data losses and lost business opportunity. Recovering from a major system failure can also cost a lot of money, sometimes exceeding the cost of the system itself!
And that’s not all. Indirect losses caused by system unavailability also leads to the loss of previously loyal customers to direct competitors and can damage the market reputation of your business. It can even end up eroding shareholder value. A 2014 report by EMC estimated that global enterprises lost more than USD 1.7 trillion every year due to data loss and downtime. Given how much more reliant the world is on technology now, the costs of an outage today would be much higher. It, therefore, become critical to ensure that IT systems have self-healing mechanisms in place to correct any aberrations in real-time. Just like your body did when you had to catch the bus.
Self-healing in IT systems: Understanding the basics
In the IT world, self-healing systems are described as “any device or system that has the ability to perceive that it is not operating correctly and, without external assistance, make the necessary adjustments to restore itself to normal operation”. What this essentially means is that a self-healing system can proactively monitor and identify a potential variance from its standard parameters, validate it with a degree of confidence and resume normal operations without human intervention.
A self-healing system, at the very least, has the following three components:
|A system, which is always expected to be up and running. The system doesn’t break down and behaves normally without any external assistance. It can be anything within the IT infrastructure, from an application, a third-party API or hardware to the network itself.||A monitoring & discovery mechanism which continuously monitors the system to ensure that it is working normally and reports any deviation from its expected behaviour. It has the knowledge of metrics of ideal / permissible range of the IT system which it is monitoring. It includes server monitoring, network monitoring, database monitoring, log monitoring and application performance monitoring, amongst other tools.||A restore protocol which takes the necessary steps to bring the system back to normal functionality without external assistance. It can vary from simple scripts to sophisticated bots. Basically, it is any software that has the ability to restore/repair the malfunctioning system.|
The following diagram gives a visual representation of the three components and how they interact with each other:
To better understand how these three components can work together to bring about self-healing in an IT system, let us take a practical example.
Consider an AEM system which has been deployed for content management. During the first week of every month, content authors are very active authoring and publishing their content in AEM forms. You observe that, quite often, the CPU utilization becomes high due to the high volume of transactions. This often results in the AEM systems being down or inaccessible, interrupting the content authors and eventually impacting the business. This is the system and the failure that it faces.
Here, you can monitor your AEM system performance deploying a PowerShell script which can monitor CPU utilization every 10 minutes, as well as observe the variance from predefined healthy parameters. Let us assume that the permissible CPU utilization limit as defined in the PowerShell scripts is 80%, and anything higher is reported as a variance. Now, the scripts deployed should be able to validate the variance with a certain degree of confidence. In this case, it can be easily done by verifying that consecutive 2-minute CPU usage has crossed the threshold of 80%. Once the variance is validated, the PowerShell scripts should also be aware knowing what needs to be done to trigger the restore service. This is the monitoring & discovery mechanism of the self-healing system.
The restore service, here, could simply be a script which can collect the necessary log information and restart the AEM services in a controlled way within the clustered environment. This prevents system outage and large-scale service disruption, thus minimizing the loss of business and consumer goodwill.
What kind of self-healing should you opt for: Different types of self-healing systems
Building self-healing systems has many challenges, such as the ability of the monitoring system to differentiate between real failures and noises. Restore systems may also have partial or complete means of recovery due to which it may not be completely reliable. Moreover, seamless recovery from failures requires building sophisticated means of dev-ops managed over clustered environments. Based on these factors, we can classify self-healing systems into three categories: Level 1, Level 2 and
Are not very complex systems. Here “rule” based solutions are defined based on which monitoring and discovery tools can trigger restore services. Most of the times, at this level, the three mechanisms are not fully connected and may require human intervention to validate and restore. Examples can be simple server restart, configuration changes scenarios.
Are systems of medium complexity. Monitoring Discovery tools help to monitor the IT System, gather all the information from various sources and show various metrics on a single dashboard for better visualization. However, what may really lack is the identification of the root cause (discover the problem with certain degree of confidence) and restore mechanism (Self-healing) parts. An independent DevOps pipeline may exist which is often triggered manually after confirming the failure.
Are not very complex systems. Here “rule” based solutions are defined based on which monitoring and discovery tools can trigger restore services. Most of the times, at this level, the three mechanisms are not fully connected and may require human intervention to validate and restore. Examples can be simple server restart,configuration changes scenarios.
Which kind of self-healing you should opt for requires you to understand the criticality of the system in question and know what level of intervention would be required. At all these levels, systems can be configured for preventive or reactive self-healing. As the name suggests, preventive self-healing is used to predict the probable problems the IT System may face in the future and take steps to avoid those problems. An example of this approach is dynamically monitoring the load and traffic of the system and predicting the load, based on the historical data and traffic of the day/time, as well as to scale/descale the system as required. This not only mitigates the risk of system de-functioning, but also optimizes the resources based on the business requirement.
With the proliferation of IT systems, mechanisms of self-healing are no longer optional. It requires major transformations in the way systems are built and designed. The advantages, however, outweigh the effort that is put upfront. By enabling early detection and restoration of the system, self-healing systems can significantly reduce the MTTR (Mean Time to restore). The automation of most monitoring processes can also result in considerable cost reduction by reducing the number of IT tickets generated, as well as lowering the delay in servicing.
In addition to tangible benefits, implementing self-healing systems can positively impact many intangible aspects such as client satisfaction and employee satisfaction. The reduction in system downtime means that enterprises can focus more on their actual business than managing IT challenges, thus improving the consistency of their service delivery. This, more than anything else, is the need of the hour in today’s technologically-driven world.