By Prabh Simran Singh, Lead Software Engineer, Salesforce
Salesforce took a very early bet on Kubernetes
(K8s) in 2015 to help us begin the journey from monolith to microservices, and
we're happily using it today across product lines and business
units. Over the last five years, we gave teams the freedom to adopt K8s
as they saw fit. So, teams across the company spun up clusters and created
customized configurations, which...became costly and difficult to manage. Teams
also had varying levels of K8s knowledge and expertise, and they weren't all
able to dedicate staff time to the operational overhead required to run a
cluster. We have many stories we could share about things that we learned the
hard way through long debugging processes. Imagine spending hours digging into an intermittent connectivity failure issue
only to discover the problem had been caused by a sysctl flag that had
been set to 0 in a naive attempt at optimization, when it should have been set
to 1!
This incident and others helped us realize we needed uniform practices,
tooling, and investments. From automation to visibility to security and network
monitoring, we needed solutions that applied across all of the
large-scale, multi-tenant clusters running across the many regions within
Salesforce. Enter the central Salesforce Kubernetes Platform team.
Our centralized team manages a substrate-agnostic K8s install. Our goal is to
empower service owners to focus on the unique value of their services without
having to worry about infrastructure. We handle concerns for the entire runtime
stack - from the Terraform provisioning pipeline to automated upgrades,
security configuration, and integration validations. Our measure of success is
increased developer agility with decreased operational costs and complexity. As
a side effect, our efforts aim to unlock economies of scale from a staffing
perspective, since teams don't have to dedicate an entire position to K8s
management.
By streamlining runtime concerns, we're able to offer a 24x7 availability
guarantee. At the same time, our team drives resiliency by continuously
implementing improvements that decrease the mean time to recover (MTTR) for all
services using our centralized platform.
Because we believe you may need to make this journey, too, we'd like to share
some of the challenges we faced and the solutions we identified.
Challenges
Where to start? The first set of challenges came from provisioning
and management. Our previous experience was with static infrastructure and
using Puppet to roll out K8s. We needed to get K8s up and running on public
cloud, while providing a substrate-agnostic way to provision and maintain the
lifecycle of the control/data plane. The goal is to do the work once in
automating upgrades to clusters, including K8s and operating system updates.
The next challenge relates to cluster visibility. With clusters being
managed by individual teams, we didn't have an exact count of even how many
there are, let alone their health. We needed to build a monitoring system for
centralized visibility and management of the entire fleet of K8s clusters. This
should include a listing of all the clusters, the configurations of each
cluster, and cluster management recommendations. Having this single pane of
glass would offer the ability to see and act on patterns across hundreds of
clusters.
Because of Salesforce's burgeoning microservice architecture, we faced the
challenge of application deployment, as well. Scores of services doing
their own thing with K8s created a level of complexity that could hinder the
company's overall ability to scale. The centralized runtime needed to offer
health-mediated dashboards for monitoring deployments, as well as addressing
advanced use-cases by different services with distinct rollout strategies, such
as canary or blue-green deployments.
We also observed issues with sidecar management. We needed a solution to
manage the rollout, monitoring, and visibility of mutating web hooks and
sidecars for common cluster use cases. (Spoiler alert: we solved this one by
building an open source tool.)
Networking is always a potential problem area, and it indeed proved
challenging across so many clusters. We needed the ability to securely manage
networking: between applications within a cluster, among applications outside
the cluster, and from outside the cluster to inside the cluster.
From secure communications we moved on to the challenge of security policy
enforcement, which needed to happen at two levels: at code check-in and and
at runtime. We needed code scanning tooling to keep developers from checking in
code if there were security issues or if there was a failure to follow
recommended patterns. We also needed runtime detection and response scanning.
While the ability to run a cluster for each microservice across the company is
one of the upsides of K8s, all of the instances we had running created a
serious challenge of cost optimization. We needed to fit as many
services as we could onto a smaller amount of hardware in order to reap the cost
benefits. We needed visibility into cost to serve and the ability to map cost
to individual services, so that we could provide service owners recommendations
on how to optimize. We needed a place to collect and showcase all of this data.
At a high level, this visibility would allow us to better estimate our hardware
needs and more accurately work with our cloud providers to meet actual needs,
rather than overestimating (and overpaying!).
And finally, if all of these challenges weren't enough, we needed to address pod
distribution. Our unwavering commitment to system availability and performance
means that we needed to make sure we maintain maximum availability. We wanted
to set up multi-availability zone distribution to increase availability. We
also wanted biased scheduling to decrease latency across zones. This wasn't
supported natively by K8s, so we had to look at building the tooling we needed.
Solutions
Little by little, we've made progress against our list of
challenges. The team has been working on robust, generic, enterprise-scale
solutions, while maintaining our commitment to both use and contribute to open
source technologies.
For provisioning and management, we're leveraging Spinnaker, Terraform,
and Helm at the provisioning step, which work across cloud provider substrates.
For management of clusters, we developed a node recycler to use during k8s and
operating system (OS) upgrades that takes into account service update
constraints such as pod distribution budget. It also works across substrates,
and we're hoping to open source it soon.
As far as cluster availability goes, we've built a visibility pipeline,
leveraging Kafka and SQL, for aggregation of cluster resources to check for
trends across clusters. It replicates K8s state into a central store, which is
exposed as a read-only pseudo-API--an http server
that mocks K8s API and serves data from MySQL database. Its job is to take kubeapi server formatted REST requests,
translate them into MySQL queries, get the result of the queries, and return
the result in the form of a REST response back to the client. This gives the
user the impression that they are directly talking to production, seamlessly. The
central store is used for customer visibility and debugging, ease of access,
custom reports, metrics, alerts. The pipeline gives us some visibility into the
hundreds of clusters running across Salesforce. We also take advantage of:
- Kubedashboard, which offers visibility into
individual clusters
- a
custom Grafana metrics dashboard
- and our
own open source K8s history visualization tool, Sloop
Helm and Spinnaker help us manage the lifecycle of
applications, and we provide Grafana dashboards that service owners can use to
see the health of their applications. So for application deployment, we
offer a custom versioned manifest solution for canary deployment. It deploys a
particular version of a Helm Chart and Spinnaker pipeline in a health-mediated
way across different environments (perf, dev, test, and prod). For blue-green
rollouts, we rely on Argo, which helps us deploy a new
application version simultaneously with the old version, and phasing out the
old version when the new is validated with traffic.
We hinted earlier that we'd built a solution for sidecar management, so let's dig into that a bit more.
At a high level, the generic sidecar injector framework divides the
configuration of the mutating webhook admission controller into two parts. The
first is what needs to be injected (sidecar configurations), and the second is
what triggers those injections (the mutation configurations). Separating out
these configurations allows teams to specify multiple sidecars and multiple
mutations, independently choosing which mutation injects which sidecars. This
loose coupling supports different team structures, as for a team supporting
multiple sidecars versus a team that supports just one. We've found this
generic mutating webhook framework extremely useful within Salesforce. Not only
are multiple teams collaborating on a single codebase, they're helping each
other get better, reviewing new changes together, and collaborating on a common
platform. This tool is open source and available at github.com/salesforce/generic-sidecar-injector.
For networking, we rely on a service mesh, specifically Istio,
for service-to-service authentication. Our colleagues have been contributing
back to this open source project as we've put it through its paces. We're
currently the number three most active contributor to the Istio project and
earned a Contribution Seat on its Steering Committee
to help play a role in overseeing and shaping the direction of the project.
To enforce security policies at runtime, we've brought in Open Policy Agent
(OPA). It runs inside a cluster and ensures that security policies are met. At
code check-in, we validate the Helm chart using OPA, which is built into our
continuous integration (CI) platform.
Our performance engineering teams have gotten very efficient at bin packing, running and simulating tests on
various instances, to work toward cost optimization. We addressed
visibility into per-tenant costs via an in-house solution built on kube-state-metrics.
From this dashboard, we're able to get visibility for underlying infrastructure
resource utilization,
per-tenant resource consumption, insight into patterns of RAM and CPU
allocation, and resource recommendation metrics from Vertical Pod Autoscaler
(VPA). We also use Horizontal Pod Autoscaler (HPA) to automatically scale the
number of pods in a deployment or stateful set based on observed CPU
utilization. We have a capacity planning team that is beginning to work on
estimating hardware needs with dashboards, but this is a big area of
opportunity for us to continue optimizing.
We currently run most of our services evenly distributed across three
availability zones (AZs). Some applications have a goal of running mostly in a
single AZ, using the others only for failover, to address the cost overhead of
inter-AZ communication. Pod Topology Spread Constraints were
insufficient to satisfy our uneven, biased pod distribution across
zones, so we're contributing to Open Kruise to support
these use cases.
Conclusion
With the initial work we've done to address the challenges
of a centralized Kubernetes platform, we've reached an inflection point for our
toolset and team. Our next challenge is to make our offerings sufficiently
feature rich so that they meet the needs of 80% of use cases across Salesforce
Engineering, making a serious reduction in the number of Kubernetes
anti-patterns we see across the disparate clusters. We'll continue working to
make the platform compatible across substrates, while setting a high security
bar in order to uphold our company commitment to trust. And, based on our
experimentation and testing so far, we've documented some best practices for
Kubernetes service owners that we'll be publishing soon, so stay tuned!
##
***To learn more about containerized
infrastructure and cloud native technologies, consider joining us at KubeCon + CloudNativeCon NA Virtual, November 17-20.
About the Author
Prabh Simran Singh, Lead Software Engineer, Salesforce
Prabh Simran Singh is an experienced professional,
currently working at Salesforce as an Infrastructure Engineer. He has worked in
cross-functional disciplines with experiences in core and application software
systems. He is passionate about discovering, building and transforming products
using high quality code and creativity in his work. His degree in
Information Systems Management has allowed him to combine his technical
knowledge and experience with an operational knowledge of the business side.
His aim is to be a problem solver and a decision maker to contribute
effectively in any organization.