By George Thangadurai, CEO, HEAL Software Inc.
The COVID-19 pandemic has brought into sharp focus the need
to scale down on data center costs while maximizing application uptime. Artificial
intelligence for IT operations (AIOps) tools have proven to be the way
forward for enterprises to streamline their operations and minimize costs. COVID
has had a huge impact on businesses and their operations, with individuals
working from home using remote collaboration, network operations centers functioning
with skeletal staff and online businesses seeing exponentially increasing traffic.
Consequently, downtime is a strict no-no, and incident resolution should be as
quick and painless as possible.
AIOps tools provide observability across silos in
heterogenous environments, proactive detection of anomalies, event correlation
to aid rapid root cause analysis and reduce mean time to resolve (MTTR). They
also derive insights and business intelligence on top of the huge volumes of
telemetry and transaction tracing data that they capture. In this article, we
look at some critical capabilities that tools need to possess to progress
beyond the traditional role that IT operations teams are expected to perform
and instead achieve the Holy Grail of a zero-downtime enterprise.
#1: Moving from "Break-and-Fix" to Preventive Healing
Many modern AIOps tools provide after-the-fact responses.
This model uses artificial intelligence and machine learning (AI/ML) techniques
to detect issues and then remediation can be automated with digital workflows
through IT service management (ITSM) integrations. However, with
preventive healing systems it is possible to go a step further and predict the
issues before they occur via patented techniques like workload-behavior
correlation. This essentially considers the effect that a certain workload
signature has on underlying system resources and flags those transaction
patterns which are likely to cause an issue imminently. By flagging a potential
issue before it even occurs and putting in place techniques to avert it - like
the dynamic optimization or "shaping" of workload so the underlying system
behavior remains unaffected, or dynamically provisioning additional resources
in cloud environments so the system can handle workload surges - the enterprise
can truly move toward zero downtime.
#2: Making Sense of Collected Data
Most AIOps and application performance management (APM)
tools are riding the next wave of monitoring, which is observability. Gartner's
latest report "Innovation Insight for Observability" addresses how
organizations can now use telemetry data captured from various sources for
supporting the health, agility and innovation of applications, and help DevOps
and site reliability engineering (SRE) teams reduce application downtime,
minimize incident resolution effort and improve overall customer experience. Observability
provides a common platform between development, operations and reliability
engineers to interpret system state and behavior. It is imperative that this
data is used effectively, the most significant application of it being to aid
and accelerate root cause analysis. Time-synchronized data captured by the
AIOps tool at the time of the incident should contain enough diagnostic data on
the state of the system at the time to allow IT operations analysts to
establish the chain of causation, accurately pinpoint the origin of the failure
and take steps to address it. Such data would typically include logs,
forensics, query-level statistics from the database, code-level tracing and
instrumentation and configuration change tracking. This helps teams
significantly reduce MTTR on the issues that cannot be predicted in advance
(like hardware glitches, network and storage outages or unavailability of 3rd
party dependencies like APIs and payment gateways).
#3: Integrations for Painless Onboarding
A single AIOps tool is frequently insufficient to deal with
the increasing complexity of hybrid digital environments; multiple tools are
needed to provide visibility across disparate silos. In such a scenario, it is
paramount that a tool provides integrations with APMs to capture required
telemetry data to learn application behavior and generate insights, a gamut of
ITSM tools so automation of ticketing workflows can be achieved, and
visualization and notification tools like Slack or JIRA to foster collaboration
among the troubleshooting teams. Some integrations, including those provided by
ITSM platforms like ServiceNow, connect the AIOps tool to business processes
outside IT, including HR, DevOps, SecOps, risk and governance. Others focus on
providing integrations with container management solutions like Docker and
Kubernetes so DevOps and Agile methodologies can be implemented across the
enterprise for continuous deployment, maintenance and management of
applications.
#4: Scaling Intelligently for Business Growth
Thanks to numerous integrations with APM, network and
infrastructure monitoring tools, there is a plethora of historical data
available for an AIOps tool to work with. This data lake is invaluable when it
comes to generating insights and planning for future scaling. However, simply
examining the growth trends on system resources is not sufficient for
forecasting capacity. It is also imperative to note any significant internal or
external factors which are going to result in a surge of workload, e.g. the
acquisition of a bank by a financial institution, a big sale for an e-commerce
provider or a marketing event which is likely to result in a sudden increase in
traffic on a website. An added element of intelligence in the planning process
is accounted for by factoring in workload growth trends and analysis of system
hotspots in conjunction with each other to make more informed forecasting
recommendations. The objectives of such an exercise are to compute workload
growth and identify capacity choke points, as well as to compute business aligned
capacity forecasts with a what-if analysis. This helps identify incorrectly
provisioned resources on the cloud to cut back on infrastructure budgets or
scale up resources as the need may arise.
Conclusion
Not all AIOps tools are the same. Evaluating your options
based on these four features can help ensure you are set up to move from a
break-and-fix to a predict-and-prevent model. AIOps adopters and decision
makers throughout the IT operations management lifecycle need to evaluate the
gaps that current AIOps offerings suffer from as they apply to the mitigation
of downtime, reduction of cost and effort in issue resolution and
business-aligned growth planning to manage costs in a multi-cloud environment.
Moving toward preventive healing is the only way forward in these tumultuous
times to ensure your data center is up and running 24x7 and your applications
are always available.
##
ABOUT
THE AUTHOR
George
Thangadurai is the CEO at Heal Software Inc., the innovator of the
game-changing preventive healing software for enterprises known as HEAL, which
fixes problems before they happen.