MLCommons
announced new results for its industry-standard MLPerf Inference v4.1
benchmark suite, which delivers machine learning (ML) system performance
benchmarking in an architecture-neutral, representative, and
reproducible manner. This release includes first-time results for a new
benchmark based on a mixture of experts (MoE) model architecture. It
also presents new findings on power consumption related to inference
execution.
MLPerf Inference v4.1
The
MLPerf Inference benchmark suite, which encompasses both data center
and edge systems, is designed to measure how quickly hardware systems
can run AI and ML models across a variety of deployment scenarios. The
open-source and peer-reviewed benchmark suite creates a level playing
field for competition that drives innovation, performance, and energy
efficiency for the entire industry. It also provides critical technical
information for customers who are procuring and tuning AI systems.
The
benchmark results for this round demonstrate broad industry
participation, and includes the debut of six newly available or
soon-to-be-shipped processors:
- AMD MI300x accelerator (available)
- AMD EPYC "Turin" CPU (preview)
- Google "Trillium" TPUv6e accelerator (preview)
- Intel "Granite Rapids" Xeon CPUs (preview)
- NVIDIA "Blackwell" B200 accelerator (preview)
- UntetherAI SpeedAI 240 Slim (available) and SpeedAI 240 (preview) accelerators
MLPerf
Inference v4.1 includes 964 performance results from 22 submitting
organizations: AMD, ASUSTek, Cisco Systems, Connect Tech Inc, CTuning
Foundation, Dell Technologies, Fujitsu, Giga Computing, Google Cloud,
Hewlett Packard Enterprise, Intel, Juniper Networks, KRAI, Lenovo,
Neutral Magic, NVIDIA, Oracle, Quanta Cloud Technology, Red Hat,
Supermicro, Sustainable Metal Cloud, and Untether AI.
"There
is now more choice than ever in AI system technologies, and it's
heartening to see providers embracing the need for open, transparent
performance benchmarks to help stakeholders evaluate their
technologies," said Mitchelle Rasquinha, MLCommons Inference working
group co-chair.
New mixture of experts benchmark
Keeping
pace with today's ever-changing AI landscape, MLPerf Inference v4.1
introduces a new benchmark to the suite: mixture of experts. MoE is an
architectural design for AI models that departs from the traditional
approach of employing a single, massive model; it instead uses a
collection of smaller "expert" models. Inference queries are directed to
a subset of the expert models to generate results. Research and
industry leaders have found that
this approach can yield equivalent accuracy to a single monolithic
model but often at a significant performance advantage because only a
fraction of the parameters are invoked with each query.
The
MoE benchmark is unique and one of the most complex implemented by
MLCommons to date. It uses the open-source Mixtral 8x7B model as a
reference implementation and performs inferences using datasets covering
three independent tasks: general Q&A, solving math problems, and
code generation.
"When
determining to add a new benchmark, the MLPerf Inference working group
observed that many key players in the AI ecosystem are strongly
embracing MoE as part of their strategy," said Miro Hodak, MLCommons
Inference working group co-chair. "Building an industry-standard
benchmark for measuring system performance on MoE models is essential to
address this trend in AI adoption. We're proud to be the first AI
benchmark suite to include MoE tests to fill this critical information
gap."
Benchmarking Power Consumption
The
MLPerf Inference v4.1 benchmark includes 31 power consumption test
results across three submitted systems covering both datacenter and edge
scenarios. These results demonstrate the continued importance of
understanding the power requirements for AI systems running inference
tasks, as power costs are a substantial portion of the overall expense
of operating AI systems.
The Increasing Pace of AI Innovation
Today,
we are witnessing an incredible groundswell of technological advances
across the AI ecosystem, driven by a wide range of providers including
AI pioneers; large, well-established technology companies; and small
startups.
MLCommons
would especially like to welcome first-time MLPerf Inference submitters
AMD and Sustainable Metal Cloud, as well as Untether AI, which
delivered both performance and power efficiency results.
"It's
encouraging to see the breadth of technical diversity in the systems
submitted to the MLPerf Inference benchmark as vendors adopt new
techniques for optimizing system performance such as vLLM and
sparsity-aware inference," said David Kanter, Head of MLPerf at
MLCommons. "Farther down the technology stack, we were struck by the
substantial increase in unique accelerator technologies submitted to the
benchmark this time. We are excited to see that systems are now
evolving at a much faster pace - at every layer - to meet the needs of
AI. We are delighted to be a trusted provider of open, fair, and
transparent benchmarks that help stakeholders get the data they need to
make sense of the fast pace of AI innovation and drive the industry
forward."
View the Results
To view the results for MLPerf Inference v4.1, please visit the Datacenter and Edge benchmark results pages.