As high performance computing (HPC) teams move more workloads to
cloud-native environments to take advantage of scale and agility, orchestration
of jobs across on-premises and cloud environments slows progress. CIQ Fuzzball Federate, unveiled today in early
access at SC24, addresses these challenges and enables
researchers to use a comprehensive management platform to define and execute
important HPC, artificial intelligence (AI) and machine learning (ML) workloads
across disparate and disconnected resources.
For example, with Fuzzball Federate, researchers can now define
and deliver workloads to unified systems and then connect them with AWS
clusters to help scale them in the cloud without modification. Conversely,
workloads can be prototyped in the cloud before the organization commits to a
capital expenditure for local, production resources. Either way, Fuzzball
ensures workflows execute repeatably, reliably and performantly, regardless of
the underlying infrastructure.
"With Fuzzball, researchers no longer need a Ph.D. in
infrastructure to manage their complex HPC and AI/ML workloads in hybrid
environments," said Gregory Kurtzer, founder and CEO of CIQ. "Fuzzball Federate
is the third leg of the stool within the Fuzzball ecosystem, furthering our
goal of delivering the most comprehensive and complete performance computing
platform for research institutions and enterprises alike. We're excited to
provide the first glimpses of Fuzzball Federate at SC24 in Atlanta, and we invite
all attendees to come take a look and give us your feedback."
Fuzzball: Substrate, Orchestrate and Federate
CIQ's Fuzzball, first released in August 2023, is a modern,
performance-intense compute platform that simplifies the creation and
deployment of complex HPC and AI/ML workloads. Running on top of Kubernetes, it
is API based and provides an easy-to-use graphical interface to automate the
provisioning and management of the necessary infrastructure to run these jobs.
The infrastructure management layer of individual Fuzzball
clusters has two main components: Fuzzball Substrate, which delivers a custom
container runtime and resource manager, and Fuzzball Orchestrate, which manages
and schedules complex, multi-step workloads and data ingress and egress.
Today at SC24, CIQ is unveiling the third component: Fuzzball
Federate. It works with Substrate and Orchestrate to unify and provide seamless
access and management of compute resources across on-prem clusters and cloud
computing regions.
In a federated Fuzzball environment, users define and submit
workflows with the same web user interface and command-line interface they
would use in a single Orchestrate deployment. However, where workflows
submitted directly to an Orchestrate cluster may run only on the resources
available to that single cluster, workflows submitted to a Federate cluster may
run on any of the Orchestrate clusters joined to the federation. These
Orchestrate clusters may be dynamically provisioned cloud resources (e.g., running
compute jobs on AWS EC2) or local, on-prem compute clusters.
Federate evaluates the CPU, memory, accelerator and storage
requirements of the workflow against the resources available in each attached
Orchestrate cluster and dispatches the workflow to an appropriate cluster for
execution. The Orchestrate cluster then provisions the necessary resources (in
cloud environments) and dispatches individual compute jobs via Substrate.
Single-cluster deployments of Orchestrate are still supported, and
an existing Orchestrate deployment can be joined with additional deployments in
a federation at any time.