By Sebastien Guilloux, Senior Software
Engineer, Elastic
The rapid rise in popularity of Kubernetes led us all to
deploy stateless workloads, but also more and more stateful workloads - despite
its initial design for stateless applications. With Pods that can be
created or removed on demand, production constraints for stateful applications
like databases are a challenge. Availability, consistency, and resiliency in
particular require careful thought.
StatefulSets
and PersistentVolumes for stateful workloads
Bringing Pods up and down is not that simple
with stateful systems: each node has an identity and some data attached to it.
Suddenly removing a Pod likely means losing its data, and disrupting the
system.
StatefulSets were introduced to solve that
particular problem. They associate a set of Pods (nodes of a distributed
system) with PersistentVolumes that persist the application data. Each volume
is bound to a Pod, whose identity is preserved throughout the entire lifecycle
of the StatefulSet. For example, volume-1
may be bound to pod-1. Due to that
strict relationship, if pod-1 is
recreated, it will automatically be bound to volume-1, again. As such, as seen from a database perspective,
recreating a Pod is analogous to restarting the database node.
In order to ensure smooth upgrades, upscales,
and downscales, StatefulSets can be configured to ensure Pods get processed
sequentially, in a rolling fashion. pod-0
is created first, then once ready, pod-1
is created, etc. This also applies to upgrades, in a reverse order: on any
change in the Pod specification, pod-2
is recreated first, then once ready, pod-1
is recreated, etc.
Day-2 operations: scaling, config
changes, and upgrades
Coming up with the right StatefulSet specification to run
your stateful distributed workload is one thing. Ensuring it runs for years in
a production environment, scales up and down, and allows configuration changes
and version upgrades, is a different story. We want to be able to make changes
to the system while preserving consistency and availability. While this is
possible with most distributed systems, they still need to be operated the
right way.
Let's take Elasticsearch as an example. Say we want to
downscale a cluster from 10 nodes to 3 nodes. Suddenly removing 7 nodes at once
is a potentially destructive operation; it may lead to the loss of all copies
of the data, or to the disruption of the master nodes quorum. A better way to
handle node removal would be to first migrate data from the node to be removed
to other nodes in the system, before performing the actual Pod deletion. Data
migration can easily be achieved through a call to the Elasticsearch API, but the
StatefulSet controller is necessarily generic and cannot possibly know about
every possible way to manage data migration and replication.
Moreover, a single Elasticsearch cluster may require
multiple StatefulSets: master nodes, hot data nodes, warm data nodes, zone
awareness affinity settings, and so on. The specification of a single
StatefulSet is immutable; hence it cannot cover multiple Elasticsearch node
topologies. In that case, multiple StatefulSets need to be operated at once.
Operators know better
Of course, it is possible to operate stateful workloads
using StatefulSets only. In practice, however, they are rarely enough to handle
complex distributed systems in production environments. This is where Kubernetes Operators come into play.
Operators are processes that interact with the Kubernetes
API to manage a set of resources, similar to Kubernetes built-in controllers.
They encode product-specific knowledge, and can implement the same logic a
human operator would use to operate the application. Unsurprisingly, the most
popular Operators focus on distributed databases.
Let's reuse the Elasticsearch downscale example above.
Before decreasing the number of replicas in a StatefulSet, the Operator calls
the Elasticsearch API to migrate all
shards away from the nodes to remove. It checks the progress of that migration,
and eventually decreases the number of replicas in the StatefulSet so the Pod
gets deleted. In case the node to remove is a master node, we want to properly
remove it from the existing quorum of master nodes and ensure no disruption.
The Operator excludes that node from voting using another Elasticsearch API, so the quorum of
master nodes can prepare for the modified topology.
When dealing with rolling upgrades and version upgrades,
the Operator knows best which nodes should be rotated first across the various
StatefulSets forming the application. StatefulSets update strategy can be
configured. For example, by using the "onDelete" strategy, the Operators decide
which Pod to upgrade first, regardless of the StatefulSet default ordering
constraints.
The reconciliation loop
Similar to the Kubernetes built-in controllers from which
they are highly inspired, Operators implement the reconciliation loop pattern.
They watch changes on Kubernetes resources they are interested in (for example,
an Elasticsearch resource being modified, or a Pod getting deleted). Those
events trigger a reconciliation process, during which the Operator ensures
everything is set up correctly for the application to run.
During that reconciliation, the Operator computes the
expected Kubernetes resources (for example. a StatefulSet with 3 replicas for
Elasticsearch master nodes), and compares it against the actual resources that
currently exist in the cluster (a StatefulSet with 1 replica for Elasticsearch
master nodes). Based on that difference, it decides to update the actual
resources so they eventually converge towards the expected ones (upscale from 1
to 3 master nodes). In this example, the Operator first updates the StatefulSet
to have 2 replicas. Once the Elasticsearch cluster has fully acknowledged the
new master node and is ready to accept the third master node, the Operator
updates the StatefulSet again to have 3 replicas.
Reconciliations are designed to be short lived, since they
are executed frequently based on Kubernetes resource updates. It is fine to
exit the reconciliation early if a condition hasn't been met yet and make
further progress at the next reconciliation. To continue with the Elasticsearch
upscale example:
- A first reconciliation happens when the user
requests 3 master nodes. That reconciliation leads to updating the StatefulSet
with 2 replicas.
- A second reconciliation happens when the Pod
matching the second replica is actually created. At this point, it is safer to
wait for that Pod to be fully ready before adding a third master node; we don't
want to disrupt the quorum by adding more than half the masters nodes at once.
- A third reconciliation happens when the Pod matching
the second replica reaches a Ready status.
The reconciliation algorithm is not only about updating
StatefulSets. The Operator also cares about Services, ConfigMaps, Secrets, and
all of the other resources required for the application to run correctly.
Simplifying
with Custom Resource Definitions
On top of managing a whole bunch of Kubernetes resources,
Operators generally provide their own Custom Resource Definition (CRD). They
extend the existing set of resources in the Kubernetes API (Pods, StatefulSets,
Services, etc.) with additional resources to simplify the application
deployment. Thanks to those additional resources, the user can specify the
topology and configuration of a database in a few lines of yaml. Easier to
understand, easier to maintain.
Of course, the simplification achieved from a lightweight
CRD design comes with a cost. Advanced Kubernetes users may want to tweak
settings they know are available in the underlying Kubernetes resources: Pod
labels, affinity rules, init containers, etc. A good CRD design aims at keeping
things simple by providing good defaults, but also needs to empower advanced
users by allowing them to override the settings they need.
StatefulSets + Operators
In summary, StatefulSets are great building blocks for
running stateful workloads on Kubernetes. It makes sense for Operators to rely
on them and benefit from the pod volume mapping handled by Kubernetes. However,
Operators can go much further than what StatefulSets could possibly offer. They
improve the user experience through simplified custom resources and implement
their own additional orchestration logic which involves interacting with the
running application.
Each new version of Kubernetes brings a lot of new
features, and Operators evolve along the way. For example, version 1.11 introduced
PersistentVolume expansion, which is especially interesting in the context of
running stateful workloads. Unfortunately, this is not available yet through
StatefulSets, whose volume claim templates are immutable. Since Operators can
access the same underlying resources the StatefulSet controller manipulates,
they can work around this limitation by interacting with PersistentVolumeClaims
directly. A few Operators (ECK, TiDB) already handle it,
going a bit further beyond what's currently possible with StatefulSets alone.
##
***To learn more about containerized infrastructure and cloud native technologies, consider joining us at KubeCon + CloudNativeCon NA Virtual, November 17-20.
About the Author
Sebastien Guilloux, Senior Software
Engineer, Elastic
Sebastien Guilloux is a senior software
engineer at Elastic. He has spent most of his career working with distributed
systems, building resilient applications, and orchestrating Apache Kafka and
Elasticsearch nodes around the world. He currently works on writing a
Kubernetes Operator for the Elastic Stack, Elastic Cloud on Kubernetes (ECK).