Virtualization Technology News and Information
Article
RSS
Nobl9 2024 Predictions: In 2024, Reliability is Not an Afterthought: As You Rush Towards Responsible AI, Don't Forget About Reliability

vmblog-predictions-2024 

Industry executives and experts share their predictions for 2024.  Read them in this 16th annual VMblog.com series exclusive.

In 2024, Reliability is Not an Afterthought: As You Rush Towards Responsible AI, Don't Forget About Reliability

By Brian Singer, Co-Founder and Chief Product Officer, Nobl9

2023 was a year of increased focus on the reliability of software systems in an effort to optimize resources. As we transition into 2024, this emphasis is intensifying, mainly toward bolstering the security, privacy, and overall reliability of AI models. But as organizations rush towards these new, exciting AI tools, it is crucial to remember the roots of reliability in cloud infrastructure.

Responsible infrastructure management is critical to the ongoing success of digital organizations, starting with setting reasonable service level objectives (SLOs). Beyond the widespread adoption of SLOs (Gartner predicts that 75% of enterprises will be using SLOs by 2025), 2024 will see successful firms reevaluating their cloud strategies and undergoing cloud repatriation.

Ultimately, ensuring compliance and accuracy in AI models should be treated as something other than a standalone effort. Organizations can elevate their Site Reliability Engineering (SRE) tactics and SLOs by applying them to complex algorithms and current manual measurements in MLOps use cases. This strategic application can proactively trigger warnings or automatically retrain models as necessary.

Along with cloud repatriation and AI model management, businesses will turn to self-managed Kubernetes to streamline costs. In order to properly manage these systems internally, businesses must have granular visibility into their performance, meaning this shift will transform how companies delegate responsibility for reliability standards.

Centralizing Responsibility for Reliability

Despite high hopes, savings never materialized from lift-and-shift style cloud migration to public providers. With a continued focus on optimization and bottom-line profitability, firms will repatriate workloads that cost too much to run in the cloud - gone are the days of huge cloud commits. As firms get smarter about their spend, applications will only be moved to the cloud as they are modernized (decomposed into microservices) or when there is a real benefit from running in a cloud versus another place. Further, metal offerings from Equinix and Linode are too cost-efficient to ignore.

As organizations seek cost savings through cloud repatriation, reliability is not an afterthought but is central to success at every step. More firms will begin centralizing responsibility for reliability standards while "democratizing" the work to application, development, and platform teams. Defining reliability standards is itself a full-time job and is aligned with production engineering and platform engineering. Thus, "Head of Reliability" will become more of an industry standard, similar to Head of Privacy or CSO.

AI Compliance and Cost Management

Focusing first on the reliability of cloud infrastructure informs organizations as they rush toward AI models. As firms dip their toes into using their own data to train AI models, they will experience sticker shock at the cost. Training is the most expensive part of AI; therefore, teams will need to set explicit controls on training infrastructure while achieving reasonable - but not perfect - accuracy of their models. Once again, SLOs are the answer to balance cost versus accuracy.

SLOs can be applied to optimize a data model's performance and automate the retraining process. Organizations can take an SLO approach to allocate an error budget to a model's performance, then introduce actions such as triggering alerts or automatic retraining if the model does not meet that budget. This gives a competitive advantage to companies eager to build reliable AI models, as SLOs ensure the models are retrained efficiently, only when necessary, with limited time spent monitoring model performance.

Self-Managed Kubernetes

Along with the lift and shift, more firms will seek to optimize costs by managing the Kubernetes control plane themselves. By doing so, they gain more workload control and portability, making running hybrid or multi-cloud easier. OpenShift will remain popular in the enterprise, but monetizing it is very hard.

Self-managed Kubernetes is great from a cost standpoint, but in order to make the shift successfully, organizations need granular visibility into the performance of the applications on the platform. Teams must take responsibility for properly managing Kubernetes by setting their own reliability goals and building out metrics to manage the reliability of Kubernetes resources.

For Kubernetes operators, SLOs offer the most consistent measurement of the performance of every application in their clusters. As Kubernetes clusters increase in complexity, SLOs provide a clear understanding of system health in a universal and context-free way - a huge advantage to platform owners who likely do not understand the vast number of applications they have running. By tailoring SLOs to specific business objectives, Kubernetes operators can address problems quickly with targeted actions, review service performance over time, and optimize resource allocation to avoid unnecessary costs. In the context of self-managed Kubernetes, SLOs ensure the smooth operation of services by empowering operators with objective performance measurements.

Prioritizing Reliability In 2024

Whether an organization is bringing cloud resources back on-premises, rushing to create AI models responsibly, or taking internal ownership of Kubernetes, reliability is key in 2024. Firms will do well to remember that despite the hype around AI, reliable infrastructure drives business success. Organizations need to get serious about building a culture of reliability in every aspect of their business through SLOs and innovative SRE strategies.

##

ABOUT THE AUTHOR

Brian Singer 

Brian Singer is the Co-Founder and Chief Product Officer at Nobl9. Singer is a product-focused entrepreneur with a passion for enterprise software, cloud computing, and reliability, and has in-depth knowledge of cloud and SaaS technologies and markets. He is an experienced leader, skilled at translating customer requirements into scalable business models and executing on ideas to drive revenue growth. Singer holds a BS in Computer Engineering from Brown University and an MBA from MIT. 

 

Published Friday, January 26, 2024 7:32 AM by David Marshall
Comments
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
Calendar
<January 2024>
SuMoTuWeThFrSa
31123456
78910111213
14151617181920
21222324252627
28293031123
45678910