Virtualization Technology News and Information
Article
RSS
The Cost of Cloud Doesn't Need to Go Up

In 2011 I left the Drupal world to try out something new. I joined HP Cloud, a part of Hewlett-Packard that genuinely felt like a startup within a megacorp. On paper, our mission was something suitably visionary, but in our vernacular we expressed it this way: "There is no way we are gonna let an upstart bookstore win the cloud market."

Of course, Amazon did just that. And they did it so successfully that when the name "Amazon" comes up today, we think of AWS before we think of books.

Services like AWS, MS Azure, and Google Cloud arose after the economic downturn of 2007. And for more than a decade, cloud enjoyed the booming conditions of a good economy. So it was not at all hard for tech forward organizations to justify moving existing workloads to the cloud, test myriad cloud services, and ultimately utilize a robust set of offerings far more sophisticated than most of us would be able to operate on-prem. And during this time there was almost no downward economic pressure to keep our eyes on our pocketbooks.

Things changed in 2023. Macroeconomic changes (inflation, interest rates, stock market contractions) translated into the direct mandate to cut down on our cloud spending. And at the same time, generative AI jumped to the top of every CTO's wishlist. The cost of AI-grade GPUs ran completely counter to the need to reduce spending.

We find ourselves in a conundrum between wanting (and perhaps needing) to take advantage of AI technology while keeping a tighter rein on the budget. Can it be done?

Rethinking Compute

One of the reasons our compute spend has gotten so high is that we have employed a new set of design patterns for high availability, reliability, and fault tolerance. Systems like Kubernetes are built to take one long-running service and maintain a pool of identical replicas of this service - three, five, seven, or even more instances of the same server. Those instances consume resources whether active or idle. Because containers take dozens of seconds to start up, Kubernetes keeps these replicas running even when traffic is light.

We are paying for compute that we are not using.

The obvious (but in my opinion short-sighted) way of handling a situation like this is to optimize to cost. "Right-size" compute resources, choose cheaper SKUs, make use of spot instances. And sure enough, a plethora of cloud cost control tools are commercially available to help you engineer a cheaper cloud. But it does not address the deeper problem: This system still wastefully runs compute that is not being utilized. (It's just slightly cheaper compute.)

A more promising route to longer-term cost savings starts with a question: Is there a way to build applications to be inherently cheaper to operate?

And it turns out that the bookstore (yup, Amazon again) has the answer, at least in theory. Ten years ago, AWS introduced a service that enabled them to make better use of their own spare compute. It was called Amazon Lambda, and it was predicated on the notion that developers would write event handlers instead of long-running servers.

With Lambda, when a new HTTP request was received, the Lambda system would instantiate a "serverless function" - a small bit of code that you (the AWS user) write to handle a single request and send back a single response. That function would run to completion and shut down.

With this model, instead of writing a server that listened on a port for days, weeks, or months, developers just wrote small snippets of code focused almost entirely on accomplishing a single task. And this serverless function ran on-demand, for only a few seconds.

This new way of doing things took a while to catch on, but these days AWS reports that they run over 10 trillion Lambda invocations a month. And because functions consume so much less compute power, they can be much cheaper. As David Anderson reports in his book "The Value Flywheel Effect", Liberty Mutual (an insurance company) tried switching a single web application to the serverless function pattern, which "reduced maintenance cost... from $50,000 a year to $10 a year." I doubt that every migration will have that big of a cost impact, but even if the reduction is a more modest 20% that results in notable savings.

Still, Lambda has a problem. It takes more than 200 milliseconds to cold-start a function. And that's before your code is even executed. A web developer will immediately point out to you that a delay of that size is unacceptable for the modern web. Users (and search engines) expect to see results starting at the 100 millisecond mark.

That's where a new technology shines. WebAssembly, which was originally built for the web browser, is a perfect fit for serverless functions. The open source project Spin can cold-start a serverless function in under one millisecond. And this is almost entirely due to the performance characteristics of WebAssembly. And because WebAssembly is portable across operating systems, processor architectures and cloud vendors, it can run anywhere. You are free from lock-in.

Because WebAssembly combines the low cost of Lambda-like functions - where you're paying for only seconds of use instead of hours - with a performance profile that enables you to write anything from websites to pubsub handlers, this technology can have a tremendous impact on cloud spend while conferring the side benefits of maintaining smaller and more portable codebases.

But what about the cost of AI?

Timeslicing GPUs

Generative AI involves two major steps. The first is model training, in which you build a model to answer questions. The second is inferencing, in which you submit a prompt to a model and it calculates a response.

Training is done rarely, and is a highly specialized discipline. Inferencing is done often, and is already in the toolbox of the average web developer.

Right now, both are expensive. But inferencing, by far the more frequently used, can be made much cheaper if we can simply get more effective at resource utilization. And once more, serverless functions provide the route to a solution.

When a long-running server process is responsible for LLM inferencing, it needs access to an expensive GPU. And because no other process can predict when this process will be using the GPU, that GPU is effectively locked for the entire time the server is running. That may be days, weeks, or months. There are some ways in which we can split up a large GPU and share it among multiple processes but, even so, the processes claim the fractional GPU for the duration of their runtime.

Once again, the Lambda-style serverless function has an advantage. Because each function invocation handles only one request, a function's lifetime is somewhere between a few milliseconds and a few minutes. Thus, it only needs to lock a GPU for the duration of the time it is executing. When we built the Serverless AI feature of Fermyon Cloud (a cloud host that can run WebAssembly-based serverless functions), we built a scheduling system that can share GPU resources with hyper-efficiency. One single AI-grade GPU can be shared across hundreds of applications.

At Civo Navigate in Austin, I will be giving a talk sharing how Fermyon Cloud accomplished this with a sparse set of Civo GPUs backed by Deep Green's sustainable architecture. There, I will share how we can swap AI workloads across NVIDIA A100s in just 50 milliseconds. This technique can be used to drive down the cost of GPU per application.

WebAssembly is the Third Wave of Cloud Computing

An online bookstore became a cloud powerhouse by allowing customers to run their own virtual machine images on Amazon's hardware. That was the first wave of cloud computing. In the second wave, Docker containers and Kubernetes changed the game again when they provided a better way to encapsulate a single application in a runnable unit (a container).

In the third wave of cloud computing, we can add WebAssembly into the mix as an ultra-efficient runtime with supersonic performance. WebAssembly runtimes are the perfect vehicle for executing serverless functions far faster (and with less overhead) than AWS Lambda. Not only that, but WebAssembly's portability means that the same serverless functions can run on any cloud without modification. In fact, they're equally at home on far edge (like CDNs) or near edge (like IoT). And the best part, at least in this current economic environment, is that all of these advantages accrue at a cost savings.

The next wave of cloud computing is shaping up to be the cheapest wave of cloud computing.

##

ABOUT THE AUTHOR

Matt Butcher 

Matt Butcher is co-founder and CEO of Fermyon, the serverless WebAssembly in the cloud company. He is one of the original creators of Helm, Brigade, CNAB, OAM, Glide and Krustlet. He has written and co-written many books, including "Learning Helm" and "Go in Practice." He is a co-creator of the "Illustrated Children's Guide to Kubernetes" series. These days, he works mostly on WebAssembly projects such as Spin, Fermyon Cloud and Bartholomew. He holds a Ph.D. in Philosophy. He lives in Colorado, where he drinks lots of coffee.

https://www.linkedin.com/in/mattbutcher/

https://twitter.com/technosophos

https://www.fermyon.com/

Published Monday, February 19, 2024 7:33 AM by David Marshall
Filed under: ,
Comments
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
Calendar
<February 2024>
SuMoTuWeThFrSa
28293031123
45678910
11121314151617
18192021222324
252627282912
3456789