By Keith
McClellan, Director, Partner Solutions Engineering, Cockroach Labs
Could Kubernetes be the next fault domain?
A fault domain is the area of a distributed
system that is impacted when a critical piece of infrastructure or network
service experiences problems. These days, almost all modern applications are
architected as connected micro-services running in containers in a cloud
environment. Large and small organizations alike now deploy thousands of
containers every day - a complexity of scale almost incomprehensible to the
human mind. The vast majority of them depend upon Kubernetes to orchestrate,
automate, and manage all these workloads.
That is a critical function, of course, but a
Kubernetes cluster is limited to a single datacenter or cloud region, and many
of these applications need to be distributed as close to users as possible. Kubernetes has the potential to help us
have a common operating experience across datacenters, cloud regions, and even
clouds by becoming the fault domain we design our highly-available (HA)
applications to survive.
Let's
say we want to build a three-region cluster. Without Kubernetes, even in a
single cloud, that means managing all these VMs and setting up a bunch of
scripts on each server to self heal. If the server gets shut down or restarts,
we have to write a bunch of Terraform or Ansible (or Puppet or Chef or Pulumi)
scripts to regenerate our servers. Then, if we want to be cross-cloud, we have
to do all that stuff three different ways! We gotta know the AWS way of doing
it. We gotta know the Azure way of doing it. We gotta know the Google way of
doing it.
Using
Kubernetes, though, the only thing we need to know that's specific to Azure,
AWS, or Google is how to get at a Kubernetes cluster in Azure, AWS, and Google,
and configure that cluster to be able to provision infrastructure and to be
able to communicate with each other, whether that's via private networking, a
VPN, or TLS over the Internet. Once that's done the rest of our administration
work is largely the same, regardless of where our infrastructure lives.
Kubernetes effectively gives
us a common operating system, regardless of where we're running infrastructure.
So Kubernetes is acting as our OS and it's abstracting
away the complexities of whatever availability zone (AZ) or region or cloud
that we're running on. We have a common operating language regardless of where
we're doing it, and we get all the great self-healing capabilities of
Kubernetes.
This
is great, but it is also how Kubernetes becomes the fault domain: Because the
perimeter of our Kubernetes cluster is now equal to the perimeter of the
infrastructure that we sit on top of, we can treat each Kubernetes cluster as
if it were a datacenter or cloud region for HA purposes.. So if either the
region fails or the Kubernetes cluster fails, our applications handle that
failure the same way.
By making them equivalent, we reduce the
number of dimensions that we have to manage from an availability perspective. This dramatically simplifies the
distributed application landscape because it becomes the only fault domain that
we have to think through.
The problem is that Kubernetes isn't really designed to be treated as a fault
domain.
The next-generation problem for
K8s
This is the next-generation problem that we
now need to solve: For us to be able to easily treat Kubernetes as the fault
domain for multi-region/multi-site clusters, Kubernetes itself needs to provide
a number of additional constructs to facilitate this pattern, and the
Kubernetes ecosystem and community have been chipping away at this problem for
quite a while.
That has led to various different ways to
purpose build multi-region solutions for a particular application or
application stack, but there is not yet a single unified strategy or solution
to this problem area.
The most significant of these bespoke solution
areas are networking and security, but there are also needs in the area of
infrastructure, failure recovery, observability, and monitoring. Networking,
because of the need for connectivity between clusters and then service
discovery and traffic routing between those clusters. And security, because you
need to make sure you don't have access sprawl, and you need a central trust
authority.
Networking: Currently there are cross-cluster communication platforms like Cilium, Project
Calico, Submariner
and Skupper.
Each has their pluses and minuses, but none of them seem to be the
one-size-fits-all solution that you'd hope would exist.
Load
Balancing and Service Discovery: Once the clusters can
talk to each other, they need to be able to discover instances of different
services running cross-site. And there's a need for global load balancing that
allows users to be routed to the closest available instances regardless of
where they enter an application.
Security: If you're administering a K8s cluster, generally speaking, you are
going to have pretty low-level security permissions. You effectively need to
have the same level of security permissions in each cluster, and if you're not
really careful about managing it you can end up having more or less security
than you need in a particular cluster. Unfortunately, security management in
K8s is often still a pretty manual process and this becomes more difficult the
more distributed your application gets.
Trust
and Identity: Right now, sharing a single trust and
identity source across multiple Kubernetes clusters is a somewhat painful
exercise, which exacerbates some of the other security issues you might run
into while running an application across multiple Kubernetes clusters. This
becomes even more important when you have interaction between pods across
sites, where an administrator may need to be connected to multiple Kubernetes
clusters concurrently for troubleshooting purposes.
Infrastructure
and Performance: Currently, the Kubernetes primitives
that provision pods only let you declare "how much" of something you get,
without consideration for how performant that infrastructure is - for example,
you can ask for a CPU, or a volume of a specific size, but you can't request a
particular processor or guarantee the performance characteristics of a
drive. This means each site has to be
carefully tuned and monitored to make sure you don't have performance hot or
cold spots.
Failure
Recovery: No matter how distributed a system is,
there's still the chance of a disaster that an environment wasn't designed to
survive (in CockroachDB's case, this would be losing a majority of the sites
supporting a cluster at once). A disaster recovery strategy to mitigate this
possibility requires applications to reach outside the failure domain to store
and retrieve backups and this is a non-trivial activity both in Kubernetes and
for the clouds in general.
Observability
and Monitoring: When running multi-site applications,
it's important to be able to monitor the health, performance, and behavior of
the entire system in a way that allows administrators to intervene before
problems occur and do capacity planning to manage increased demands on the
system.
We are highly aware of this at Cockroach Labs,
where Kubernetes is key to CockroachDB and our managed database services. Here
is how we use Kubernetes as the fault domain for multi-region clusters or
multi-site clusters.
Landing the control plane
You may not think about the application you're
building as global, but it is. The fact is that a deployment across two or
three sites has the same challenges as a planet-spanning multi-regional
deployment, so the application must be built with the same architectural
primitives. Unless you have an extremely localized business model you're going
to be building this way, either now or in the near future. What's amazing about this is the more
distributed an active-everywhere workload becomes, the less expensive it is to
survive any particular failure.
For example, in a traditional two-site
disaster recovery scenario, you have to have 2x of everything to be able to
continue to operate if you have a datacenter or region failure. With
CockroachDB distributed across three sites, you only need 1.5x of everything to
be able to operate without disruption. This is because even after losing a
site, you still have two more remaining.
The real cost of surviving a site failure goes down even more as you
spread an application across additional sites - for example, when spread across
five sites you only need to provision 1.25x the amount of infrastructure to be
able to continue operations undisrupted in the case of a site failure.
When you're multi-region/multi-site, we
recommend a Kubernetes cluster for each site and then we span CockroachDB
across those sites. Which is where, lacking this next-generation solution, we
had to do a bunch of custom stuff for our managed CockroachDB service to be
able to treat Kubernetes as the fault domain.
We solved this in CockroachDB Dedicated and
Serverless by building a control plane to manage this for us. We use the
networking in either Google or Amazon to allow for routing between Kubernetes
clusters in different regions, and then we use the control plane to apply all
of the security settings consistently and check to make sure they get updated.
It also provides us with the kinds of observability information we need to
support hundreds of clusters concurrently and help customers troubleshoot
issues they might be experiencing. The control plane does other things, as
well. We created a centralized key management store so administrator keys don't
have to be discreetly shipped to each separate region. We've also spent a lot
of time thinking about persistence.
Of course, what we've built is custom for
CockroachDB, just like what anyone else would have to build today to manage
these kinds of edges.
As we were initially building CockroachDB, the
database itself, we talked about Kubernetes constantly as even in a single-site
configuration the marriage of CockroachDB and Kubernetes gives the database
even better resilience characteristics than just the database alone. These
days, if you look at cockroachlabs.com, we don't mention Kubernetes nearly as
much. But internally it's still totally top of mind: all of CockroachDB
Dedicated, all of CockroachDB Serverless, and a number of our self-hosted
clusters all run in Kubernetes. It's just that the control plane handles the
complexity.
Conclusion
Hybrid, multi-region, and even multi-cloud
deployments are becoming not just increasingly common, but also increasingly
necessary for businesses needing to scale horizontally, guarantee availability,
and minimize latency. Kubernetes has the
potential to help us have a common operating experience across datacenters,
cloud regions, and even clouds by becoming the fault domain we design our
highly-available (HA) applications to survive.
We believe that the best way to do that is to
have a Kubernetes cluster in each location, and then have some sort of shared
mechanism to wire them up together effectively and be able to share security
configuration information and set up network routing and all of the other
pieces that need to be put into place to solve this next generation problem.
Every SaaS company on the planet, and every multinational company as well, has
this exact problem. At Cockroach we see this every day, because we are the
system of record database of choice for a lot of those companies and platforms.
This is the challenge now before the
Kubernetes ecosystem and community: inventing the mechanism that allows
Kubernetes to be distributed across multiple regions.
##
***To learn more about containerized
infrastructure and cloud native technologies, consider joining us at KubeCon +
CloudNativeCon Europe 2022, May 16-20.
ABOUT THE AUTHOR
Keith McClellan Director, Partner Solutions
Engineering, Cockroach Labs
Keith McClellan is a dedicated advocate for
distributed SQL and data on Kubernetes