Virtualization Technology News and Information
Article
RSS
How Salesforce Operates Kubernetes Multitenant Clusters in Public Cloud at Scale

By Prabh Simran Singh, Lead Software Engineer, Salesforce

Salesforce took a very early bet on Kubernetes (K8s) in 2015 to help us begin the journey from monolith to microservices, and we're happily using it today across product lines and business units. Over the last five years, we gave teams the freedom to adopt K8s as they saw fit. So, teams across the company spun up clusters and created customized configurations, which...became costly and difficult to manage. Teams also had varying levels of K8s knowledge and expertise, and they weren't all able to dedicate staff time to the operational overhead required to run a cluster. We have many stories we could share about things that we learned the hard way through long debugging processes. Imagine spending hours digging into an intermittent connectivity failure issue only to discover the problem had been caused by a sysctl flag that had been set to 0 in a naive attempt at optimization, when it should have been set to 1!

This incident and others helped us realize we needed uniform practices, tooling, and investments. From automation to visibility to security and network monitoring, we needed solutions that applied across all of the large-scale, multi-tenant clusters running across the many regions within Salesforce. Enter the central Salesforce Kubernetes Platform team.

Our centralized team manages a substrate-agnostic K8s install. Our goal is to empower service owners to focus on the unique value of their services without having to worry about infrastructure. We handle concerns for the entire runtime stack - from the Terraform provisioning pipeline to automated upgrades, security configuration, and integration validations. Our measure of success is increased developer agility with decreased operational costs and complexity. As a side effect, our efforts aim to unlock economies of scale from a staffing perspective, since teams don't have to dedicate an entire position to K8s management.

By streamlining runtime concerns, we're able to offer a 24x7 availability guarantee. At the same time, our team drives resiliency by continuously implementing improvements that decrease the mean time to recover (MTTR) for all services using our centralized platform.

Because we believe you may need to make this journey, too, we'd like to share some of the challenges we faced and the solutions we identified.

Challenges

Where to start? The first set of challenges came from provisioning and management. Our previous experience was with static infrastructure and using Puppet to roll out K8s. We needed to get K8s up and running on public cloud, while providing a substrate-agnostic way to provision and maintain the lifecycle of the control/data plane. The goal is to do the work once in automating upgrades to clusters, including K8s and operating system updates.

The next challenge relates to cluster visibility. With clusters being managed by individual teams, we didn't have an exact count of even how many there are, let alone their health. We needed to build a monitoring system for centralized visibility and management of the entire fleet of K8s clusters. This should include a listing of all the clusters, the configurations of each cluster, and cluster management recommendations. Having this single pane of glass would offer the ability to see and act on patterns across hundreds of clusters.

Because of Salesforce's burgeoning microservice architecture, we faced the challenge of application deployment, as well. Scores of services doing their own thing with K8s created a level of complexity that could hinder the company's overall ability to scale. The centralized runtime needed to offer health-mediated dashboards for monitoring deployments, as well as addressing advanced use-cases by different services with distinct rollout strategies, such as canary or blue-green deployments.

We also observed issues with sidecar management. We needed a solution to manage the rollout, monitoring, and visibility of mutating web hooks and sidecars for common cluster use cases. (Spoiler alert: we solved this one by building an open source tool.)

Networking is always a potential problem area, and it indeed proved challenging across so many clusters. We needed the ability to securely manage networking: between applications within a cluster, among applications outside the cluster, and from outside the cluster to inside the cluster.

From secure communications we moved on to the challenge of security policy enforcement, which needed to happen at two levels: at code check-in and and at runtime. We needed code scanning tooling to keep developers from checking in code if there were security issues or if there was a failure to follow recommended patterns. We also needed runtime detection and response scanning.

While the ability to run a cluster for each microservice across the company is one of the upsides of K8s, all of the instances we had running created a serious challenge of cost optimization. We needed to fit as many services as we could onto a smaller amount of hardware in order to reap the cost benefits. We needed visibility into cost to serve and the ability to map cost to individual services, so that we could provide service owners recommendations on how to optimize. We needed a place to collect and showcase all of this data. At a high level, this visibility would allow us to better estimate our hardware needs and more accurately work with our cloud providers to meet actual needs, rather than overestimating (and overpaying!).

And finally, if all of these challenges weren't enough, we needed to address pod distribution. Our unwavering commitment to system availability and performance means that we needed to make sure we maintain maximum availability. We wanted to set up multi-availability zone distribution to increase availability. We also wanted biased scheduling to decrease latency across zones. This wasn't supported natively by K8s, so we had to look at building the tooling we needed.

Solutions

Little by little, we've made progress against our list of challenges. The team has been working on robust, generic, enterprise-scale solutions, while maintaining our commitment to both use and contribute to open source technologies.

For provisioning and management, we're leveraging Spinnaker, Terraform, and Helm at the provisioning step, which work across cloud provider substrates. For management of clusters, we developed a node recycler to use during k8s and operating system (OS) upgrades that takes into account service update constraints such as pod distribution budget. It also works across substrates, and we're hoping to open source it soon.

As far as cluster availability goes, we've built a visibility pipeline, leveraging Kafka and SQL, for aggregation of cluster resources to check for trends across clusters. It replicates K8s state into a central store, which is exposed as a read-only pseudo-API--an http server that mocks K8s API and serves data from MySQL database. Its job is to take kubeapi server formatted REST requests, translate them into MySQL queries, get the result of the queries, and return the result in the form of a REST response back to the client. This gives the user the impression that they are directly talking to production, seamlessly. The central store is used for customer visibility and debugging, ease of access, custom reports, metrics, alerts. The pipeline gives us some visibility into the hundreds of clusters running across Salesforce. We also take advantage of:

  • Kubedashboard, which offers visibility into individual clusters
  • a custom Grafana metrics dashboard
  • and our own open source K8s history visualization tool, Sloop

Helm and Spinnaker help us manage the lifecycle of applications, and we provide Grafana dashboards that service owners can use to see the health of their applications. So for application deployment, we offer a custom versioned manifest solution for canary deployment. It deploys a particular version of a Helm Chart and Spinnaker pipeline in a health-mediated way across different environments (perf, dev, test, and prod). For blue-green rollouts, we rely on Argo, which helps us deploy a new application version simultaneously with the old version, and phasing out the old version when the new is validated with traffic.

We hinted earlier that we'd built a solution for sidecar management, so let's dig into that a bit more. At a high level, the generic sidecar injector framework divides the configuration of the mutating webhook admission controller into two parts. The first is what needs to be injected (sidecar configurations), and the second is what triggers those injections (the mutation configurations). Separating out these configurations allows teams to specify multiple sidecars and multiple mutations, independently choosing which mutation injects which sidecars. This loose coupling supports different team structures, as for a team supporting multiple sidecars versus a team that supports just one. We've found this generic mutating webhook framework extremely useful within Salesforce. Not only are multiple teams collaborating on a single codebase, they're helping each other get better, reviewing new changes together, and collaborating on a common platform. This tool is open source and available at github.com/salesforce/generic-sidecar-injector.

For networking, we rely on a service mesh, specifically Istio, for service-to-service authentication. Our colleagues have been contributing back to this open source project as we've put it through its paces. We're currently the number three most active contributor to the Istio project and earned a Contribution Seat on its Steering Committee to help play a role in overseeing and shaping the direction of the project.

To enforce security policies at runtime, we've brought in Open Policy Agent (OPA). It runs inside a cluster and ensures that security policies are met. At code check-in, we validate the Helm chart using OPA, which is built into our continuous integration (CI) platform.

Our performance engineering teams have gotten very efficient at bin packing, running and simulating tests on various instances, to work toward cost optimization. We addressed visibility into per-tenant costs via an in-house solution built on kube-state-metrics. From this dashboard, we're able to get visibility for underlying infrastructure resource utilization,
per-tenant resource consumption, insight into patterns of RAM and CPU allocation, and resource recommendation metrics from Vertical Pod Autoscaler (VPA). We also use Horizontal Pod Autoscaler (HPA) to automatically scale the number of pods in a deployment or stateful set based on observed CPU utilization. We have a capacity planning team that is beginning to work on estimating hardware needs with dashboards, but this is a big area of opportunity for us to continue optimizing.

We currently run most of our services evenly distributed across three availability zones (AZs). Some applications have a goal of running mostly in a single AZ, using the others only for failover, to address the cost overhead of inter-AZ communication. Pod Topology Spread Constraints were insufficient to satisfy our uneven, biased pod distribution across zones, so we're contributing to Open Kruise to support these use cases.

Conclusion

With the initial work we've done to address the challenges of a centralized Kubernetes platform, we've reached an inflection point for our toolset and team. Our next challenge is to make our offerings sufficiently feature rich so that they meet the needs of 80% of use cases across Salesforce Engineering, making a serious reduction in the number of Kubernetes anti-patterns we see across the disparate clusters. We'll continue working to make the platform compatible across substrates, while setting a high security bar in order to uphold our company commitment to trust. And, based on our experimentation and testing so far, we've documented some best practices for Kubernetes service owners that we'll be publishing soon, so stay tuned!

##

***To learn more about containerized infrastructure and cloud native technologies, consider joining us at KubeCon + CloudNativeCon NA Virtual, November 17-20.

About the Author

Prabh Simran Singh, Lead Software Engineer, Salesforce

Prabh Simran Singh 

Prabh Simran Singh is an experienced professional, currently working at Salesforce as an Infrastructure Engineer. He has worked in cross-functional disciplines with experiences in core and application software systems. He is passionate about discovering, building and transforming products using high quality code and creativity in his work.  His degree in Information Systems Management has allowed him to combine his technical knowledge and experience with an operational knowledge of the business side. His aim is to be a problem solver and a decision maker to contribute effectively in any organization. 

Published Friday, October 30, 2020 7:34 AM by David Marshall
Filed under: ,
Comments
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
top25
Calendar
<October 2020>
SuMoTuWeThFrSa
27282930123
45678910
11121314151617
18192021222324
25262728293031
1234567