Virtualization Technology News and Information
Going beyond StatefulSets with Kubernetes Operators

By Sebastien Guilloux, Senior Software Engineer, Elastic 

The rapid rise in popularity of Kubernetes led us all to deploy stateless workloads, but also more and more stateful workloads - despite its initial design for stateless applications. With Pods that can be created or removed on demand, production constraints for stateful applications like databases are a challenge. Availability, consistency, and resiliency in particular require careful thought.

StatefulSets and PersistentVolumes for stateful workloads

Bringing Pods up and down is not that simple with stateful systems: each node has an identity and some data attached to it. Suddenly removing a Pod likely means losing its data, and disrupting the system.

StatefulSets were introduced to solve that particular problem. They associate a set of Pods (nodes of a distributed system) with PersistentVolumes that persist the application data. Each volume is bound to a Pod, whose identity is preserved throughout the entire lifecycle of the StatefulSet. For example, volume-1 may be bound to pod-1. Due to that strict relationship, if pod-1 is recreated, it will automatically be bound to volume-1, again. As such, as seen from a database perspective, recreating a Pod is analogous to restarting the database node.

In order to ensure smooth upgrades, upscales, and downscales, StatefulSets can be configured to ensure Pods get processed sequentially, in a rolling fashion. pod-0 is created first, then once ready, pod-1 is created, etc. This also applies to upgrades, in a reverse order: on any change in the Pod specification, pod-2 is recreated first, then once ready, pod-1 is recreated, etc.

Day-2 operations: scaling, config changes, and upgrades

Coming up with the right StatefulSet specification to run your stateful distributed workload is one thing. Ensuring it runs for years in a production environment, scales up and down, and allows configuration changes and version upgrades, is a different story. We want to be able to make changes to the system while preserving consistency and availability. While this is possible with most distributed systems, they still need to be operated the right way.

Let's take Elasticsearch as an example. Say we want to downscale a cluster from 10 nodes to 3 nodes. Suddenly removing 7 nodes at once is a potentially destructive operation; it may lead to the loss of all copies of the data, or to the disruption of the master nodes quorum. A better way to handle node removal would be to first migrate data from the node to be removed to other nodes in the system, before performing the actual Pod deletion. Data migration can easily be achieved through a call to the Elasticsearch API, but the StatefulSet controller is necessarily generic and cannot possibly know about every possible way to manage data migration and replication.

Moreover, a single Elasticsearch cluster may require multiple StatefulSets: master nodes, hot data nodes, warm data nodes, zone awareness affinity settings, and so on. The specification of a single StatefulSet is immutable; hence it cannot cover multiple Elasticsearch node topologies. In that case, multiple StatefulSets need to be operated at once.


Operators know better

Of course, it is possible to operate stateful workloads using StatefulSets only. In practice, however, they are rarely enough to handle complex distributed systems in production environments. This is where Kubernetes Operators come into play.

Operators are processes that interact with the Kubernetes API to manage a set of resources, similar to Kubernetes built-in controllers. They encode product-specific knowledge, and can implement the same logic a human operator would use to operate the application. Unsurprisingly, the most popular Operators focus on distributed databases.

Let's reuse the Elasticsearch downscale example above. Before decreasing the number of replicas in a StatefulSet, the Operator calls the Elasticsearch API to migrate all shards away from the nodes to remove. It checks the progress of that migration, and eventually decreases the number of replicas in the StatefulSet so the Pod gets deleted. In case the node to remove is a master node, we want to properly remove it from the existing quorum of master nodes and ensure no disruption. The Operator excludes that node from voting using another Elasticsearch API, so the quorum of master nodes can prepare for the modified topology.

When dealing with rolling upgrades and version upgrades, the Operator knows best which nodes should be rotated first across the various StatefulSets forming the application. StatefulSets update strategy can be configured. For example, by using the "onDelete" strategy, the Operators decide which Pod to upgrade first, regardless of the StatefulSet default ordering constraints.

The reconciliation loop

Similar to the Kubernetes built-in controllers from which they are highly inspired, Operators implement the reconciliation loop pattern. They watch changes on Kubernetes resources they are interested in (for example, an Elasticsearch resource being modified, or a Pod getting deleted). Those events trigger a reconciliation process, during which the Operator ensures everything is set up correctly for the application to run.

During that reconciliation, the Operator computes the expected Kubernetes resources (for example. a StatefulSet with 3 replicas for Elasticsearch master nodes), and compares it against the actual resources that currently exist in the cluster (a StatefulSet with 1 replica for Elasticsearch master nodes). Based on that difference, it decides to update the actual resources so they eventually converge towards the expected ones (upscale from 1 to 3 master nodes). In this example, the Operator first updates the StatefulSet to have 2 replicas. Once the Elasticsearch cluster has fully acknowledged the new master node and is ready to accept the third master node, the Operator updates the StatefulSet again to have 3 replicas.

Reconciliations are designed to be short lived, since they are executed frequently based on Kubernetes resource updates. It is fine to exit the reconciliation early if a condition hasn't been met yet and make further progress at the next reconciliation. To continue with the Elasticsearch upscale example:

  • A first reconciliation happens when the user requests 3 master nodes. That reconciliation leads to updating the StatefulSet with 2 replicas.
  • A second reconciliation happens when the Pod matching the second replica is actually created. At this point, it is safer to wait for that Pod to be fully ready before adding a third master node; we don't want to disrupt the quorum by adding more than half the masters nodes at once.
  • A third reconciliation happens when the Pod matching the second replica reaches a Ready status.

The reconciliation algorithm is not only about updating StatefulSets. The Operator also cares about Services, ConfigMaps, Secrets, and all of the other resources required for the application to run correctly.


Simplifying with Custom Resource Definitions

On top of managing a whole bunch of Kubernetes resources, Operators generally provide their own Custom Resource Definition (CRD). They extend the existing set of resources in the Kubernetes API (Pods, StatefulSets, Services, etc.) with additional resources to simplify the application deployment. Thanks to those additional resources, the user can specify the topology and configuration of a database in a few lines of yaml. Easier to understand, easier to maintain.

Of course, the simplification achieved from a lightweight CRD design comes with a cost. Advanced Kubernetes users may want to tweak settings they know are available in the underlying Kubernetes resources: Pod labels, affinity rules, init containers, etc. A good CRD design aims at keeping things simple by providing good defaults, but also needs to empower advanced users by allowing them to override the settings they need.

StatefulSets + Operators

In summary, StatefulSets are great building blocks for running stateful workloads on Kubernetes. It makes sense for Operators to rely on them and benefit from the pod volume mapping handled by Kubernetes. However, Operators can go much further than what StatefulSets could possibly offer. They improve the user experience through simplified custom resources and implement their own additional orchestration logic which involves interacting with the running application.

Each new version of Kubernetes brings a lot of new features, and Operators evolve along the way. For example, version 1.11 introduced PersistentVolume expansion, which is especially interesting in the context of running stateful workloads. Unfortunately, this is not available yet through StatefulSets, whose volume claim templates are immutable. Since Operators can access the same underlying resources the StatefulSet controller manipulates, they can work around this limitation by interacting with PersistentVolumeClaims directly. A few Operators (ECK, TiDB) already handle it, going a bit further beyond what's currently possible with StatefulSets alone.


***To learn more about containerized infrastructure and cloud native technologies, consider joining us at KubeCon + CloudNativeCon NA Virtual, November 17-20. 

About the Author

Sebastien Guilloux, Senior Software Engineer, Elastic 

Sebastien Guilloux 

Sebastien Guilloux is a senior software engineer at Elastic. He has spent most of his career working with distributed systems, building resilient applications, and orchestrating Apache Kafka and Elasticsearch nodes around the world. He currently works on writing a Kubernetes Operator for the Elastic Stack, Elastic Cloud on Kubernetes (ECK).

Published Thursday, October 29, 2020 7:34 AM by David Marshall
Filed under: ,
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<October 2020>