Virtualization Technology News and Information
Article
RSS
Kubeflow is ready for production

At its core, Kubeflow offers an end-to-end Machine Learning (ML) Stack orchestration toolkit to build, deploy, scale and manage complex ML workflows on top of Kubernetes. It consists of multiple  components that handle different parts of a typical ML workflow, like exploration with Jupyter Notebooks, distributed training, ML pipelines, hyperparameter tuning, and inference among others.

Kubeflow 1.0 was announced back in March 2020 with a main focus on graduating a core set of its applications needed to develop, build, train, and deploy models on Kubernetes efficiently. While this was a huge step for Kubeflow, as a community of developers  there was still room for improvement before we could claim it was ready for production, at-scale.

But, we've come a long way since then and the project has undergone major changes since the initial 1.0 release. These changes, including the robustness of the code and how the project is developed and organized, have drastically improved the end user experience and strengthened our governance  processes. In this article you'll get a tour of all the project's exciting updates since 1.0.  plus, as  the 1.4 Release Manager, why I believe Kubeflow is ready for production.

Organization based on Working Groups

After version 1.0 was released the Kubeflow community decided to move into a governance model very similar to Kubernetes. The project was broken down into distinct Working Groups (WGs) and SIGs, which have chairs, tech leads, their own calendars etc. Working groups became the primary vehicle for technical decision-making in the project, and regularly report on that work to the wider project in community channels.

 

With this transition the different components , and GitHub repos, of Kubeflow are now maintained by the corresponding Working Groups. Kubeflow currently has the following Working Groups:

  • Serving : for doing model inference
  • Training for performing distributed training on Kubernetes
  • Manifests : for providing simple to install manifests of all the Kubeflow components

This arrangement has allowed Kubeflow to be governed in an open fashion, where contributors could collaborate more efficiently and move the project forward together with the Working Group leads.

Simplified installation

One of the main pain points we've heard from users is that Kubeflow was really complicated to install. This was indeed the case and mostly caused by the use of kfctl, a CLI tool, which handled installing, uninstalling as well as upgrading Kubeflow installations.

While the premise of this tool was promising, the implementation didn't stand up to the expectations. The CLI had some custom logic for parsing and generating the YAML files, which in many cases had bugs or did not support features of kustomize. Later on there was an effort to support the v3 kustomize which further complicated the structure of the manifests.

All the above led to a user experience that left a lot to be desired.

But, the Manifests WG folks took this feedback into consideration very seriously  and worked on drastically simplifying the structure of the manifests. In the 1.3 release of Kubeflow, the Manifests WG decided to directly provide kustomize YAMLs for the different components of Kubeflow.

This approach removed the dependency in kfctl and treated the different components of Kubeflow as simple kustomize packages. This decision also came with some powerful benefits, necessary for handling manifests effectively:

  1. Users could now declaratively apply configurations for their different components.
  2. It reinforces GitOps patterns, since users can git add/commit/apply their changes as overlays on top of the base upstream manifest.
  3. Users can easily mix and match different components in their Kubeflow installation.

For 1.3 the installation process was even simplified down to a single kustomize command.

Established release process

Kubeflow 1.3 was the first release that was completely driven by Kubeflow's new Working Group structure. The release process, spearheaded by Yannis Zarkadas as the release manager, required efficient communication between the different Working Groups in order to be successfully completed. While we hit some bumps in the road, in the end we managed,as a community, to overcome them and do our first coordinated release!

The community even used the 1.3 release as an opportunity to further improve its processes. After the completion of the release we had a very productive retrospective in which all the stakeholders of the release, even the distribution owners, provided their feedback on the process. We took this invaluable feedback and converted it into a first iteration of a release handbook, capturing the good parts and aiming to mitigate the biggest hardships we faced.

With this handbook in our arsenal we started the 1.4 release, but this time with a well defined process. The release process includes different phases, like Feature Freeze and Documentation update, in order to ensure the quality of the final code by allowing room for a lot of testing and stabilization. The release process has also allocated a specific time frame for Kubeflow distributions to also battle test the manifests, before we release the final version.

All the above lead to a more robust and production-ready Kubefow 1.4 release that has gotten a green light from both the Working Group leads as well as from Kubeflow's distributions.

Web UIs for core ML workflows

In both  the 1.3 and 1.4 releases there was a huge focus on the UX provided by the platform. We strongly believe that even the most robust and powerful API falls short if the corresponding UI cannot effectively allow users to interact with the underlying infrastructure.

With this mindset we aimed to provide intuitive web apps for the more advanced Kubeflow APIs, like AutoML, Model serving and Volume management to make it seamless for users to interact with the platform's mechanisms.

Throughout these two releases we have added a total of four  new web applications that manage different parts of a typical ML workflow. For reference, we've included web apps for managing:

  • PVCs for holding the Data Scientists datasets in the cluster
  • TensorBoard instances for visualizations
  • AutoML experiments and visualizing with graphs the performance of different models
  • Model servers, which are used for inference

These web applications aim to allow users to launch and manage the different parts of their ML workflows, while hiding a lot of the underlying K8s concepts that are not relevant to Data Scientists.

At the same time the web apps comply with Kubeflow's security policies and allow for isolation between teams by respecting Kubernetes RBAC and using namespace isolation. This comes hand in hand with a robust authentication system, which the cluster admins can mix and match with their prefered OIDC provider.

All of the above result in a platform that allows end users to perform all of their workflows at scale, by leveraging the distributed nature of Kubernetes, and at the same time allowing administrators to enforce production-level configurations and policies.

What's next

The next big area that is hot in Kubeflow right now is Metadata and how to integrate our components together by having a common metadata scheme.  At the same time there's a lot of work underway in many of Kubeflow's components. From supporting new search algorithms in Kubeflow's AutoML, like Neural Architecture Search, to unifying our different distributed training operators.

Hopefully this blog has demonstrated to you that we have made a lot of progress since the initial 1.0 release and Kubeflow is now ready for production. In fact, you don't have to take my word for it, check this growing list of companies that are being successful with Kubeflow in production.

Arrikto's Commitment to the Kubeflow Community

At Arrikto, we are active members of the Kubeflow community having made significant contributions to the latest 1.3 release. Our projects/products include:

  • MiniKF, a production-ready, local Kubeflow deployment that installs in minutes, and understands how to downscale your infrastructure.
  • Enterprise Kubeflow (EKF) is a complete machine learning operations platform that simplifies, accelerates, and secures the machine learning model development lifecycle with Kubeflow.
  • Rok is a data management solution for Kubeflow. Rok's built-in Kubeflow integration simplifies operations and increases performance, while enabling data versioning, packaging, and secure sharing across teams and cloud boundaries.
  • Kale, a workflow tool for Kubeflow, which orchestrates all of Kubeflow's components seamlessly.

##

To hear more about cloud native topics, join the Cloud Native Computing Foundation and cloud native community at KubeCon+CloudNativeCon North America 2021 - October 11-15, 2021      

ABOUT THE AUTHOR

Kimonas Sotirchos, Full Stack Engineer, Arrikto

Kimonas Sotirchos 

Kimonas is a Software Engineer interested in cloud native applications and distributed systems. Loves to work on open source projects, collaborate and develop innovative software as part of a community. Kimonas has been a core Kubeflow contributor for more than two years with expertise around web apps and the UX required to allow Data Scientists to run their workflows seamlessly on top of Kubernetes.

Published Wednesday, September 22, 2021 7:34 AM by David Marshall
Filed under: ,
Comments
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
Calendar
<September 2021>
SuMoTuWeThFrSa
2930311234
567891011
12131415161718
19202122232425
262728293012
3456789