In 2011 I left the Drupal world to try
out something new. I joined HP Cloud, a part of Hewlett-Packard that genuinely
felt like a startup within a megacorp. On paper, our mission was something
suitably visionary, but in our vernacular we expressed it this way: "There is
no way we are gonna let an upstart bookstore win the cloud market."
Of course, Amazon did just that. And they
did it so successfully that when the name "Amazon" comes up today, we think of
AWS before we think of books.
Services like AWS, MS Azure, and Google
Cloud arose after the economic downturn of 2007. And for more than a decade,
cloud enjoyed the booming conditions of a good economy. So it was not at all
hard for tech forward organizations to justify moving existing workloads to the
cloud, test myriad cloud services, and ultimately utilize a robust set of
offerings far more sophisticated than most of us would be able to operate
on-prem. And during this time there was almost no downward economic pressure to
keep our eyes on our pocketbooks.
Things changed in 2023. Macroeconomic
changes (inflation, interest rates, stock market contractions) translated into
the direct mandate to cut down on our cloud spending. And at the same time,
generative AI jumped to the top of every CTO's wishlist. The cost of AI-grade
GPUs ran completely counter to the need to reduce spending.
We find ourselves in a conundrum between
wanting (and perhaps needing) to take advantage of AI technology while keeping
a tighter rein on the budget. Can it be done?
Rethinking Compute
One of the reasons our compute spend has
gotten so high is that we have employed a new set of design patterns for high
availability, reliability, and fault tolerance. Systems like Kubernetes are
built to take one long-running service and maintain a pool of identical
replicas of this service - three, five, seven, or even more instances of the
same server. Those instances consume resources whether active or idle. Because
containers take dozens of seconds to start up, Kubernetes keeps these replicas
running even when traffic is light.
We are paying for compute that we are not
using.
The obvious (but in my opinion
short-sighted) way of handling a situation like this is to optimize to cost.
"Right-size" compute resources, choose cheaper SKUs, make use of spot
instances. And sure enough, a plethora of cloud cost control tools are commercially
available to help you engineer a cheaper cloud. But it does not address the
deeper problem: This system still wastefully runs compute that is not being
utilized. (It's just slightly cheaper compute.)
A more promising route to longer-term
cost savings starts with a question: Is there a way to build applications to be
inherently cheaper to operate?
And it turns out that the bookstore (yup,
Amazon again) has the answer, at least in theory. Ten years ago, AWS introduced
a service that enabled them to make better use of their own spare compute. It
was called Amazon Lambda, and it was predicated on the notion that developers
would write event handlers instead of long-running servers.
With Lambda, when a new HTTP request was
received, the Lambda system would instantiate a "serverless function" - a small
bit of code that you (the AWS user) write to handle a single request and send
back a single response. That function would run to completion and shut down.
With this model, instead of writing a
server that listened on a port for days, weeks, or months, developers just
wrote small snippets of code focused almost entirely on accomplishing a single
task. And this serverless function ran on-demand, for only a few seconds.
This new way of doing things took a while
to catch on, but these days AWS reports that they run over 10 trillion Lambda
invocations a month. And because functions consume so much less compute power,
they can be much cheaper. As David Anderson reports in his book "The Value
Flywheel Effect", Liberty Mutual (an insurance company) tried switching a
single web application to the serverless function pattern, which "reduced
maintenance cost... from $50,000 a year to $10 a year." I doubt that every
migration will have that big of a cost impact, but even if the reduction is a
more modest 20% that results in notable savings.
Still, Lambda has a problem. It takes
more than 200 milliseconds to cold-start a function. And that's before your
code is even executed. A web developer will immediately point out to you that a
delay of that size is unacceptable for the modern web. Users (and search
engines) expect to see results starting at the 100 millisecond mark.
That's where a new technology shines.
WebAssembly, which was originally built for the web browser, is a perfect fit
for serverless functions. The open source project Spin can cold-start a serverless function in
under one millisecond. And this is almost entirely due to the performance
characteristics of WebAssembly. And because WebAssembly is portable across
operating systems, processor architectures and cloud vendors, it can run
anywhere. You are free from lock-in.
Because WebAssembly combines the low cost
of Lambda-like functions - where you're paying for only seconds of use instead
of hours - with a performance profile that enables you to write anything from
websites to pubsub handlers, this technology can have a tremendous impact on
cloud spend while conferring the side benefits of maintaining smaller and more
portable codebases.
But what about the cost of AI?
Timeslicing GPUs
Generative AI involves two major steps.
The first is model training, in which you build a model to answer questions.
The second is inferencing, in which you submit a prompt to a model and it
calculates a response.
Training is done rarely, and is a highly
specialized discipline. Inferencing is done often, and is already in the
toolbox of the average web developer.
Right now, both are expensive. But
inferencing, by far the more frequently used, can be made much cheaper if we
can simply get more effective at resource utilization. And once more,
serverless functions provide the route to a solution.
When a long-running server process is
responsible for LLM inferencing, it needs access to an expensive GPU. And
because no other process can predict when this process will be using the GPU,
that GPU is effectively locked for the entire time the server is running. That
may be days, weeks, or months. There are some ways in which we can split up a
large GPU and share it among multiple processes but, even so, the processes
claim the fractional GPU for the duration of their runtime.
Once again, the Lambda-style serverless
function has an advantage. Because each function invocation handles only one
request, a function's lifetime is somewhere between a few milliseconds and a
few minutes. Thus, it only needs to lock a GPU for the duration of the time it
is executing. When we built the Serverless AI feature of Fermyon Cloud (a cloud
host that can run WebAssembly-based serverless functions), we built a
scheduling system that can share GPU resources with hyper-efficiency. One
single AI-grade GPU can be shared across hundreds of applications.
At Civo Navigate in Austin, I will be
giving a talk sharing how Fermyon Cloud accomplished this with a sparse set of
Civo GPUs backed by Deep Green's sustainable architecture. There, I will share
how we can swap AI workloads across NVIDIA A100s in just 50 milliseconds. This
technique can be used to drive down the cost of GPU per application.
WebAssembly is the Third Wave
of Cloud Computing
An online bookstore became a cloud
powerhouse by allowing customers to run their own virtual machine images on
Amazon's hardware. That was the first wave of cloud computing. In the second
wave, Docker containers and Kubernetes changed the game again when they
provided a better way to encapsulate a single application in a runnable unit (a
container).
In the third wave of cloud computing, we
can add WebAssembly into the mix as an ultra-efficient runtime with supersonic
performance. WebAssembly runtimes are the perfect vehicle for executing
serverless functions far faster (and with less overhead) than AWS Lambda. Not
only that, but WebAssembly's portability means that the same serverless
functions can run on any cloud without modification. In fact, they're equally
at home on far edge (like CDNs) or near edge (like IoT). And the best part, at
least in this current economic environment, is that all of these advantages
accrue at a cost savings.
The next wave of cloud computing is
shaping up to be the cheapest wave of cloud computing.
##
ABOUT THE AUTHOR
Matt Butcher is co-founder and CEO of
Fermyon, the serverless WebAssembly in the cloud company. He is one of the
original creators of Helm, Brigade, CNAB, OAM, Glide and Krustlet. He has
written and co-written many books, including "Learning Helm" and
"Go in Practice." He is a co-creator of the "Illustrated
Children's Guide to Kubernetes" series. These days, he works mostly on
WebAssembly projects such as Spin, Fermyon Cloud and Bartholomew. He holds a
Ph.D. in Philosophy. He lives in Colorado, where he drinks lots of coffee.
https://www.linkedin.com/in/mattbutcher/
https://twitter.com/technosophos
https://www.fermyon.com/