At its core, Kubeflow offers an
end-to-end Machine Learning (ML) Stack orchestration toolkit to build, deploy,
scale and manage complex ML workflows on top of Kubernetes. It consists of
multiple components that handle
different parts of a typical ML workflow, like exploration with Jupyter
Notebooks, distributed training, ML pipelines, hyperparameter tuning, and
inference among others.
Kubeflow 1.0 was announced
back in March 2020 with a main focus on graduating a core set of its
applications needed to develop, build, train, and deploy models on Kubernetes
efficiently. While this was a huge step for Kubeflow, as a community of
developers there was still room for
improvement before we could claim it was ready for production, at-scale.
But, we've come a long way
since then and the project has undergone major changes since the initial 1.0
release. These changes, including the robustness of the code and how the
project is developed and organized, have drastically improved the end user
experience and strengthened our governance
processes. In this article you'll get a tour of all the project's
exciting updates since 1.0. plus,
as the 1.4 Release Manager, why I
believe Kubeflow is ready for production.
Organization based on Working
Groups
After version 1.0 was
released the Kubeflow community decided to move into a governance model very similar to Kubernetes. The project was broken down into distinct Working Groups (WGs) and SIGs, which have chairs, tech leads, their own calendars etc.
Working groups became the primary vehicle for technical decision-making in the
project, and regularly report on that work to the wider project in community
channels.
With this transition
the different components , and GitHub repos, of Kubeflow are now maintained by
the corresponding Working Groups. Kubeflow currently has the following Working
Groups:
- Serving :
for doing model inference
- Training for
performing distributed training on Kubernetes
- Manifests :
for providing simple to install manifests of all the Kubeflow components
This arrangement has
allowed Kubeflow to be governed in an open fashion, where contributors could
collaborate more efficiently and move the project forward together with the
Working Group leads.
Simplified installation
One of the main pain
points we've heard from users is that Kubeflow was really complicated to
install. This was indeed the case and mostly caused by the use of kfctl, a CLI tool, which
handled installing, uninstalling as well as upgrading Kubeflow installations.
While the premise of
this tool was promising, the implementation didn't stand up to the
expectations. The CLI had some custom logic for parsing and generating the YAML
files, which in many cases had bugs or did not support features of kustomize. Later on there was an effort to
support the v3 kustomize which further complicated the structure of the
manifests.
All the above led to a
user experience that left a lot to be desired.
But, the Manifests WG
folks took this feedback into consideration very seriously and worked on drastically simplifying the
structure of the manifests. In the 1.3 release of Kubeflow, the Manifests WG
decided to directly provide kustomize YAMLs for the different components of
Kubeflow.
This approach removed
the dependency in kfctl and treated the different components of Kubeflow as
simple kustomize packages. This decision also came with some powerful benefits,
necessary for handling manifests effectively:
- Users could now declaratively apply
configurations for their different components.
- It reinforces GitOps patterns, since users can
git add/commit/apply their changes as overlays on top of the base upstream
manifest.
- Users can easily mix and match different
components in their Kubeflow installation.
For 1.3 the
installation process was even simplified down to a single kustomize command.
Established release process
Kubeflow 1.3 was the
first release that was completely driven by Kubeflow's new Working Group
structure. The release process, spearheaded by Yannis
Zarkadas as the release
manager, required efficient communication between the different Working Groups
in order to be successfully completed. While we hit some bumps in the road, in
the end we managed,as a community, to overcome them and do our first
coordinated release!
The community even used
the 1.3 release as an opportunity to further improve its processes. After the
completion of the release we had a very productive retrospective in which all the stakeholders of the release, even the distribution
owners, provided their feedback on the process. We took this invaluable
feedback and converted it into a first iteration of a release handbook, capturing the good parts and aiming to mitigate the biggest hardships
we faced.
With this handbook in
our arsenal we started the 1.4 release, but this time with a well defined
process. The release process includes different phases, like Feature Freeze and
Documentation update, in order to ensure the quality of the final code by
allowing room for a lot of testing and stabilization. The release process has
also allocated a specific time frame for Kubeflow distributions to also battle
test the manifests, before we release the final version.
All the above lead to a
more robust and production-ready Kubefow 1.4 release that has gotten a green
light from both the Working Group leads as well as from Kubeflow's
distributions.
Web UIs for core ML workflows
In both the 1.3 and 1.4 releases there was a huge
focus on the UX provided by the platform. We strongly believe that even the
most robust and powerful API falls short if the corresponding UI cannot
effectively allow users to interact with the underlying infrastructure.
With this mindset we
aimed to provide intuitive web apps for the more advanced Kubeflow APIs, like
AutoML, Model serving and Volume management to make it seamless for users to
interact with the platform's mechanisms.
Throughout these two
releases we have added a total of four
new web applications that manage different parts of a typical ML
workflow. For reference, we've included web apps for managing:
- PVCs for holding the
Data Scientists datasets in the cluster
- TensorBoard instances
for visualizations
- AutoML experiments and
visualizing with graphs the performance of different models
- Model servers, which
are used for inference
These web applications
aim to allow users to launch and manage the different parts of their ML
workflows, while hiding a lot of the underlying K8s concepts that are not
relevant to Data Scientists.
At the same time the
web apps comply with Kubeflow's security policies and allow for isolation
between teams by respecting Kubernetes RBAC and using namespace isolation. This
comes hand in hand with a robust authentication system, which the cluster
admins can mix and match with their prefered OIDC provider.
All of the above result
in a platform that allows end users to perform all of their workflows at scale,
by leveraging the distributed nature of Kubernetes, and at the same time
allowing administrators to enforce production-level configurations and
policies.
What's next
The next
big area that is hot in Kubeflow right now is Metadata and how to integrate our
components together by having a common metadata scheme. At the same time there's a lot of work
underway in many of Kubeflow's components. From supporting new search algorithms
in Kubeflow's AutoML, like Neural Architecture Search, to
unifying our different distributed
training operators.
Hopefully
this blog has demonstrated to you that we have made a lot of progress since the
initial 1.0 release and Kubeflow is now ready for production. In fact, you
don't have to take my word for it, check this growing list of
companies that are
being successful with Kubeflow in production.
Arrikto's Commitment to the
Kubeflow Community
At Arrikto, we are active members of the Kubeflow community having made significant contributions to the latest
1.3 release. Our projects/products include:
- MiniKF, a
production-ready, local Kubeflow deployment that installs in minutes, and
understands how to downscale your infrastructure.
- Enterprise Kubeflow (EKF) is a complete machine learning operations platform that simplifies,
accelerates, and secures the machine learning model development lifecycle with
Kubeflow.
- Rok is a data management
solution for Kubeflow. Rok's built-in Kubeflow integration simplifies
operations and increases performance, while enabling data versioning,
packaging, and secure sharing across teams and cloud boundaries.
- Kale, a workflow tool for Kubeflow, which orchestrates
all of Kubeflow's components seamlessly.
##
To hear more
about cloud native topics, join the Cloud Native Computing Foundation and cloud native community at KubeCon+CloudNativeCon North America 2021 - October 11-15, 2021
ABOUT THE AUTHOR
Kimonas Sotirchos, Full Stack Engineer, Arrikto
Kimonas is a Software Engineer interested in cloud native applications and distributed systems. Loves to work on open source projects, collaborate and develop innovative software as part of a community. Kimonas has been a core Kubeflow contributor for more than two years with expertise around web apps and the UX required to allow Data Scientists to run their workflows seamlessly on top of Kubernetes.