Virtualization Technology News and Information
Critical Capabilities to Look for in AIOps Tools

capabilities aiops 

By George Thangadurai, CEO, HEAL Software Inc.

The COVID-19 pandemic has brought into sharp focus the need to scale down on data center costs while maximizing application uptime. Artificial intelligence for IT operations (AIOps) tools have proven to be the way forward for enterprises to streamline their operations and minimize costs. COVID has had a huge impact on businesses and their operations, with individuals working from home using remote collaboration, network operations centers functioning with skeletal staff and online businesses seeing exponentially increasing traffic. Consequently, downtime is a strict no-no, and incident resolution should be as quick and painless as possible.

AIOps tools provide observability across silos in heterogenous environments, proactive detection of anomalies, event correlation to aid rapid root cause analysis and reduce mean time to resolve (MTTR). They also derive insights and business intelligence on top of the huge volumes of telemetry and transaction tracing data that they capture. In this article, we look at some critical capabilities that tools need to possess to progress beyond the traditional role that IT operations teams are expected to perform and instead achieve the Holy Grail of a zero-downtime enterprise.

#1: Moving from "Break-and-Fix" to Preventive Healing

Many modern AIOps tools provide after-the-fact responses. This model uses artificial intelligence and machine learning (AI/ML) techniques to detect issues and then remediation can be automated with digital workflows through IT service management (ITSM) integrations. However, with preventive healing systems it is possible to go a step further and predict the issues before they occur via patented techniques like workload-behavior correlation. This essentially considers the effect that a certain workload signature has on underlying system resources and flags those transaction patterns which are likely to cause an issue imminently. By flagging a potential issue before it even occurs and putting in place techniques to avert it - like the dynamic optimization or "shaping" of workload so the underlying system behavior remains unaffected, or dynamically provisioning additional resources in cloud environments so the system can handle workload surges - the enterprise can truly move toward zero downtime.

#2: Making Sense of Collected Data

Most AIOps and application performance management (APM) tools are riding the next wave of monitoring, which is observability. Gartner's latest report "Innovation Insight for Observability" addresses how organizations can now use telemetry data captured from various sources for supporting the health, agility and innovation of applications, and help DevOps and site reliability engineering (SRE) teams reduce application downtime, minimize incident resolution effort and improve overall customer experience. Observability provides a common platform between development, operations and reliability engineers to interpret system state and behavior. It is imperative that this data is used effectively, the most significant application of it being to aid and accelerate root cause analysis. Time-synchronized data captured by the AIOps tool at the time of the incident should contain enough diagnostic data on the state of the system at the time to allow IT operations analysts to establish the chain of causation, accurately pinpoint the origin of the failure and take steps to address it. Such data would typically include logs, forensics, query-level statistics from the database, code-level tracing and instrumentation and configuration change tracking. This helps teams significantly reduce MTTR on the issues that cannot be predicted in advance (like hardware glitches, network and storage outages or unavailability of 3rd party dependencies like APIs and payment gateways).

#3: Integrations for Painless Onboarding

A single AIOps tool is frequently insufficient to deal with the increasing complexity of hybrid digital environments; multiple tools are needed to provide visibility across disparate silos. In such a scenario, it is paramount that a tool provides integrations with APMs to capture required telemetry data to learn application behavior and generate insights, a gamut of ITSM tools so automation of ticketing workflows can be achieved, and visualization and notification tools like Slack or JIRA to foster collaboration among the troubleshooting teams. Some integrations, including those provided by ITSM platforms like ServiceNow, connect the AIOps tool to business processes outside IT, including HR, DevOps, SecOps, risk and governance. Others focus on providing integrations with container management solutions like Docker and Kubernetes so DevOps and Agile methodologies can be implemented across the enterprise for continuous deployment, maintenance and management of applications.

#4: Scaling Intelligently for Business Growth

Thanks to numerous integrations with APM, network and infrastructure monitoring tools, there is a plethora of historical data available for an AIOps tool to work with. This data lake is invaluable when it comes to generating insights and planning for future scaling. However, simply examining the growth trends on system resources is not sufficient for forecasting capacity. It is also imperative to note any significant internal or external factors which are going to result in a surge of workload, e.g. the acquisition of a bank by a financial institution, a big sale for an e-commerce provider or a marketing event which is likely to result in a sudden increase in traffic on a website. An added element of intelligence in the planning process is accounted for by factoring in workload growth trends and analysis of system hotspots in conjunction with each other to make more informed forecasting recommendations. The objectives of such an exercise are to compute workload growth and identify capacity choke points​, as well as to compute business aligned capacity forecasts with a what-if analysis. This helps identify incorrectly provisioned resources on the cloud to cut back on infrastructure budgets or scale up resources as the need may arise.


Not all AIOps tools are the same. Evaluating your options based on these four features can help ensure you are set up to move from a break-and-fix to a predict-and-prevent model. AIOps adopters and decision makers throughout the IT operations management lifecycle need to evaluate the gaps that current AIOps offerings suffer from as they apply to the mitigation of downtime, reduction of cost and effort in issue resolution and business-aligned growth planning to manage costs in a multi-cloud environment. Moving toward preventive healing is the only way forward in these tumultuous times to ensure your data center is up and running 24x7 and your applications are always available.



George Thangadurai 

George Thangadurai is the CEO at Heal Software Inc., the innovator of the game-changing preventive healing software for enterprises known as HEAL, which fixes problems before they happen.

Published Monday, March 01, 2021 7:40 AM by David Marshall
Filed under:
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<March 2021>