Virtualization Technology News and Information
Is Kubernetes the Next Fault Domain?

By Keith McClellan, Director, Partner Solutions Engineering, Cockroach Labs

Could Kubernetes be the next fault domain?

A fault domain is the area of a distributed system that is impacted when a critical piece of infrastructure or network service experiences problems. These days, almost all modern applications are architected as connected micro-services running in containers in a cloud environment. Large and small organizations alike now deploy thousands of containers every day - a complexity of scale almost incomprehensible to the human mind. The vast majority of them depend upon Kubernetes to orchestrate, automate, and manage all these workloads.

That is a critical function, of course, but a Kubernetes cluster is limited to a single datacenter or cloud region, and many of these applications need to be distributed as close to users as possible. Kubernetes has the potential to help us have a common operating experience across datacenters, cloud regions, and even clouds by becoming the fault domain we design our highly-available (HA) applications to survive. 

Let's say we want to build a three-region cluster. Without Kubernetes, even in a single cloud, that means managing all these VMs and setting up a bunch of scripts on each server to self heal. If the server gets shut down or restarts, we have to write a bunch of Terraform or Ansible (or Puppet or Chef or Pulumi) scripts to regenerate our servers. Then, if we want to be cross-cloud, we have to do all that stuff three different ways! We gotta know the AWS way of doing it. We gotta know the Azure way of doing it. We gotta know the Google way of doing it.

Using Kubernetes, though, the only thing we need to know that's specific to Azure, AWS, or Google is how to get at a Kubernetes cluster in Azure, AWS, and Google, and configure that cluster to be able to provision infrastructure and to be able to communicate with each other, whether that's via private networking, a VPN, or TLS over the Internet. Once that's done the rest of our administration work is largely the same, regardless of where our infrastructure lives.

Kubernetes effectively gives us a common operating system, regardless of where we're running infrastructure. So Kubernetes is acting as our OS and it's abstracting away the complexities of whatever availability zone (AZ) or region or cloud that we're running on. We have a common operating language regardless of where we're doing it, and we get all the great self-healing capabilities of Kubernetes.

This is great, but it is also how Kubernetes becomes the fault domain: Because the perimeter of our Kubernetes cluster is now equal to the perimeter of the infrastructure that we sit on top of, we can treat each Kubernetes cluster as if it were a datacenter or cloud region for HA purposes.. So if either the region fails or the Kubernetes cluster fails, our applications handle that failure the same way.

By making them equivalent, we reduce the number of dimensions that we have to manage from an availability perspective. This dramatically simplifies the distributed application landscape because it becomes the only fault domain that we have to think through.

The problem is that Kubernetes isn't really designed to be treated as a fault domain.

The next-generation problem for K8s

This is the next-generation problem that we now need to solve: For us to be able to easily treat Kubernetes as the fault domain for multi-region/multi-site clusters, Kubernetes itself needs to provide a number of additional constructs to facilitate this pattern, and the Kubernetes ecosystem and community have been chipping away at this problem for quite a while.

That has led to various different ways to purpose build multi-region solutions for a particular application or application stack, but there is not yet a single unified strategy or solution to this problem area.

The most significant of these bespoke solution areas are networking and security, but there are also needs in the area of infrastructure, failure recovery, observability, and monitoring. Networking, because of the need for connectivity between clusters and then service discovery and traffic routing between those clusters. And security, because you need to make sure you don't have access sprawl, and you need a central trust authority.

Networking: Currently there are cross-cluster communication platforms like Cilium, Project Calico, Submariner and Skupper. Each has their pluses and minuses, but none of them seem to be the one-size-fits-all solution that you'd hope would exist.

Load Balancing and Service Discovery: Once the clusters can talk to each other, they need to be able to discover instances of different services running cross-site. And there's a need for global load balancing that allows users to be routed to the closest available instances regardless of where they enter an application.

Security: If you're administering a K8s cluster, generally speaking, you are going to have pretty low-level security permissions. You effectively need to have the same level of security permissions in each cluster, and if you're not really careful about managing it you can end up having more or less security than you need in a particular cluster. Unfortunately, security management in K8s is often still a pretty manual process and this becomes more difficult the more distributed your application gets.

Trust and Identity: Right now, sharing a single trust and identity source across multiple Kubernetes clusters is a somewhat painful exercise, which exacerbates some of the other security issues you might run into while running an application across multiple Kubernetes clusters. This becomes even more important when you have interaction between pods across sites, where an administrator may need to be connected to multiple Kubernetes clusters concurrently for troubleshooting purposes.

Infrastructure and Performance: Currently, the Kubernetes primitives that provision pods only let you declare "how much" of something you get, without consideration for how performant that infrastructure is - for example, you can ask for a CPU, or a volume of a specific size, but you can't request a particular processor or guarantee the performance characteristics of a drive.  This means each site has to be carefully tuned and monitored to make sure you don't have performance hot or cold spots.

Failure Recovery: No matter how distributed a system is, there's still the chance of a disaster that an environment wasn't designed to survive (in CockroachDB's case, this would be losing a majority of the sites supporting a cluster at once). A disaster recovery strategy to mitigate this possibility requires applications to reach outside the failure domain to store and retrieve backups and this is a non-trivial activity both in Kubernetes and for the clouds in general.

Observability and Monitoring: When running multi-site applications, it's important to be able to monitor the health, performance, and behavior of the entire system in a way that allows administrators to intervene before problems occur and do capacity planning to manage increased demands on the system.

We are highly aware of this at Cockroach Labs, where Kubernetes is key to CockroachDB and our managed database services. Here is how we use Kubernetes as the fault domain for multi-region clusters or multi-site clusters.

Landing the control plane

You may not think about the application you're building as global, but it is. The fact is that a deployment across two or three sites has the same challenges as a planet-spanning multi-regional deployment, so the application must be built with the same architectural primitives. Unless you have an extremely localized business model you're going to be building this way, either now or in the near future.  What's amazing about this is the more distributed an active-everywhere workload becomes, the less expensive it is to survive any particular failure.  

For example, in a traditional two-site disaster recovery scenario, you have to have 2x of everything to be able to continue to operate if you have a datacenter or region failure. With CockroachDB distributed across three sites, you only need 1.5x of everything to be able to operate without disruption. This is because even after losing a site, you still have two more remaining.  The real cost of surviving a site failure goes down even more as you spread an application across additional sites - for example, when spread across five sites you only need to provision 1.25x the amount of infrastructure to be able to continue operations undisrupted in the case of a site failure.

When you're multi-region/multi-site, we recommend a Kubernetes cluster for each site and then we span CockroachDB across those sites. Which is where, lacking this next-generation solution, we had to do a bunch of custom stuff for our managed CockroachDB service to be able to treat Kubernetes as the fault domain.

We solved this in CockroachDB Dedicated and Serverless by building a control plane to manage this for us. We use the networking in either Google or Amazon to allow for routing between Kubernetes clusters in different regions, and then we use the control plane to apply all of the security settings consistently and check to make sure they get updated. It also provides us with the kinds of observability information we need to support hundreds of clusters concurrently and help customers troubleshoot issues they might be experiencing. The control plane does other things, as well. We created a centralized key management store so administrator keys don't have to be discreetly shipped to each separate region. We've also spent a lot of time thinking about persistence.

Of course, what we've built is custom for CockroachDB, just like what anyone else would have to build today to manage these kinds of edges.

As we were initially building CockroachDB, the database itself, we talked about Kubernetes constantly as even in a single-site configuration the marriage of CockroachDB and Kubernetes gives the database even better resilience characteristics than just the database alone. These days, if you look at, we don't mention Kubernetes nearly as much. But internally it's still totally top of mind: all of CockroachDB Dedicated, all of CockroachDB Serverless, and a number of our self-hosted clusters all run in Kubernetes. It's just that the control plane handles the complexity.


Hybrid, multi-region, and even multi-cloud deployments are becoming not just increasingly common, but also increasingly necessary for businesses needing to scale horizontally, guarantee availability, and minimize latency.  Kubernetes has the potential to help us have a common operating experience across datacenters, cloud regions, and even clouds by becoming the fault domain we design our highly-available (HA) applications to survive.

We believe that the best way to do that is to have a Kubernetes cluster in each location, and then have some sort of shared mechanism to wire them up together effectively and be able to share security configuration information and set up network routing and all of the other pieces that need to be put into place to solve this next generation problem. Every SaaS company on the planet, and every multinational company as well, has this exact problem. At Cockroach we see this every day, because we are the system of record database of choice for a lot of those companies and platforms.

This is the challenge now before the Kubernetes ecosystem and community: inventing the mechanism that allows Kubernetes to be distributed across multiple regions.


***To learn more about containerized infrastructure and cloud native technologies, consider joining us at KubeCon + CloudNativeCon Europe 2022, May 16-20.


Keith McClellan Director, Partner Solutions Engineering, Cockroach Labs


Keith McClellan is a dedicated advocate for distributed SQL and data on Kubernetes

Published Wednesday, May 04, 2022 7:31 AM by David Marshall
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<May 2022>