Industry executives and experts share their predictions for 2024. Read them in this 16th annual VMblog.com series exclusive.
In 2024, Reliability is Not an Afterthought: As You Rush Towards Responsible AI, Don't Forget About Reliability
By
Brian Singer, Co-Founder and Chief Product Officer, Nobl9
2023 was a year of increased focus on the
reliability of software systems in an effort to optimize resources. As we
transition into 2024, this emphasis is intensifying, mainly toward bolstering
the security, privacy, and overall reliability of AI models. But as
organizations rush towards these new, exciting AI tools, it is crucial to
remember the roots of reliability in cloud infrastructure.
Responsible infrastructure management is
critical to the ongoing success of digital organizations, starting with setting
reasonable service level objectives (SLOs). Beyond the widespread adoption of
SLOs (Gartner predicts that 75% of enterprises will be using SLOs by 2025),
2024 will see successful firms reevaluating their cloud strategies and
undergoing cloud repatriation.
Ultimately, ensuring compliance and accuracy
in AI models should be treated as something other than a standalone effort.
Organizations can elevate their Site Reliability Engineering (SRE) tactics and
SLOs by applying them to complex algorithms and current manual measurements in
MLOps use cases. This strategic application can proactively trigger warnings or
automatically retrain models as necessary.
Along with cloud repatriation and AI model
management, businesses will turn to self-managed Kubernetes to streamline
costs. In order to properly manage these systems internally, businesses must
have granular visibility into their performance, meaning this shift will
transform how companies delegate responsibility for reliability standards.
Centralizing Responsibility for
Reliability
Despite high hopes, savings
never materialized from lift-and-shift style cloud migration to public
providers. With a continued focus on optimization and bottom-line
profitability, firms will repatriate workloads that cost too much to run in the
cloud - gone are the days of huge cloud commits. As firms get smarter about
their spend, applications will only be moved to the cloud as they are
modernized (decomposed into microservices) or when there is a real benefit from
running in a cloud versus another place. Further, metal offerings from Equinix
and Linode are too cost-efficient to ignore.
As organizations seek cost
savings through cloud repatriation, reliability is not an afterthought but is
central to success at every step. More firms will begin centralizing
responsibility for reliability standards while "democratizing" the
work to application, development, and platform teams. Defining reliability
standards is itself a full-time job and is aligned with production engineering
and platform engineering. Thus, "Head of Reliability" will become
more of an industry standard, similar to Head of Privacy or CSO.
AI Compliance and Cost
Management
Focusing first on the
reliability of cloud infrastructure informs organizations as they rush toward
AI models. As firms dip their toes into using their own data to train AI
models, they will experience sticker shock at the cost. Training is the most
expensive part of AI; therefore, teams will need to set explicit controls on
training infrastructure while achieving reasonable - but not perfect -
accuracy of their models. Once again, SLOs are the answer to balance cost
versus accuracy.
SLOs can be applied to optimize
a data model's performance and automate the retraining process. Organizations
can take an SLO approach to allocate an error budget to a model's performance,
then introduce actions such as triggering alerts or automatic retraining if the
model does not meet that budget. This gives a competitive advantage to companies eager to build reliable AI
models, as SLOs ensure the models are retrained efficiently, only when
necessary, with limited time spent monitoring model performance.
Self-Managed Kubernetes
Along with the lift and shift,
more firms will seek to optimize costs by managing the Kubernetes control plane
themselves. By doing so, they gain more workload control and portability,
making running hybrid or multi-cloud easier. OpenShift will remain popular in
the enterprise, but monetizing it is very hard.
Self-managed Kubernetes is great
from a cost standpoint, but in order to make the shift successfully,
organizations need granular visibility into the performance of the applications
on the platform. Teams must take responsibility for properly managing Kubernetes
by setting their own reliability goals and building out metrics to manage the
reliability of Kubernetes resources.
For Kubernetes operators, SLOs
offer the most consistent measurement of the performance of every application
in their clusters. As Kubernetes clusters increase in complexity, SLOs provide
a clear understanding of system health in a universal and context-free way
- a huge advantage to platform owners who likely do not understand the
vast number of applications they have running. By tailoring SLOs to specific
business objectives, Kubernetes operators can address problems quickly with
targeted actions, review service performance over time, and optimize resource
allocation to avoid unnecessary costs. In the context of self-managed
Kubernetes, SLOs ensure the smooth operation of services by empowering
operators with objective performance measurements.
Prioritizing Reliability In 2024
Whether an organization is
bringing cloud resources back on-premises, rushing to create AI models
responsibly, or taking internal ownership of Kubernetes, reliability is key in
2024. Firms will do well to remember that despite the hype around AI, reliable
infrastructure drives business success. Organizations need to get serious about
building a culture of reliability in every aspect of their business through
SLOs and innovative SRE strategies.
##
ABOUT THE AUTHOR
Brian Singer is the Co-Founder and
Chief Product Officer at Nobl9. Singer is a
product-focused entrepreneur with a passion for enterprise software, cloud
computing, and reliability, and has in-depth knowledge of cloud and SaaS
technologies and markets. He is an experienced leader, skilled at translating
customer requirements into scalable business models and executing on ideas to
drive revenue growth. Singer holds a BS in Computer Engineering from Brown
University and an MBA from MIT.