As artificial intelligence (AI) applications are more widely
adopted, businesses are increasingly looking for the right IT infrastructure
solutions to run these applications and algorithms. This infrastructure represents
a significant investment, and organizations need a reliable way to parse
through the offerings of multiple vendors to understand the best value and
performance of the systems on the market.
MLPerf has emerged as one of the best tools for
making this type of evaluation. Founded by researchers and engineers from
Baidu, Google, Harvard University, Stanford University, and the University of
California, Berkeley, MLPerf plays a critical role in the industry by
establishing industry benchmarks. Shawn Wu, Chief Researcher at Inspur
Information, recently spoke with VMblog about MLPerf, how Inspur Information
continues to excel in the industry benchmarks and the importance of MLPerf
going forward.
VMblog: What is MLPerf?
Shawn Wu: MLPerf is a machine learning (ML)
performance benchmark that evaluates the quality, speed and efficiency of
computing nodes using typical workloads such as training models, tasks and
datasets. The closed division of the open-source and peer-reviewed benchmark
suite provides a level playing field that drives innovation, performance, and
energy-efficiency for the entire industry. MLPerf is independently managed by
MLCommons, an
open engineering consortium, and focuses on full system tests that stress
machine learning models, software, and hardware for a broad range of
applications.
VMblog: How open and diverse is the
participation in MLPerf benchmarking?
Wu: MLPerf is one of the world's most
influential benchmark for AI performance, with members from more than 50 global
leading AI companies and top academic institutions, including Facebook, Google,
Inspur Information, Intel, NVIDIA, and others previously mentioned.
The latest MLPerf Training v2.0
attracted 21 global manufacturers and research institutions, a new record for
number of participants. There were 264 submissions, a 50% increase over the
previous round. The eight AI benchmarks cover current mainstream usage AI
scenarios, including image classification with ResNet, medical image
segmentation with 3D U-Net, light-weight object detection with RetinaNet,
heavy-weight object detection with Mask R-CNN, speech recognition with RNN-T,
natural language processing with BERT, recommendation with DLRM, and
reinforcement learning with MiniGo.
VMblog: Why is MLPerf benchmarking
important?
Wu: MLPerf AI Training and
Inference benchmarks are both held twice a year to track improvements in
computing performance and provide authoritative data guidance for users.
MLPerf measures the performance of
training machine learning models, enabling researchers to unlock new
capabilities such as collision avoidance for
vehicles, robotics, medical radiological diagnosis, retail analytics, and many
others.
The latest results from MLPerf
Training v2.0, demonstrated up to 1.8X greater
performance compared to previous results, paving the way for more capable
intelligent systems to benefit society at large. These consistent
measurements enable engineers to design reliable products and services, and
enable researchers to compare innovations and choose the best solutions to
drive the innovations of tomorrow.
VMblog: How does Inspur Information
perform in the MLPerf benchmarking?
Wu: Inspur AI servers continue to
achieve AI performance breakthroughs through comprehensive software and
hardware optimization. Compared to the MLPerf v0.5 results in 2018, Inspur AI
servers showed significant performance improvements of up to 789% for typical
8-GPU server models.
The leading performance of
Inspur AI servers in MLPerf is a result of design innovation and full-stack
optimization capabilities for AI. Focusing on the bottleneck of intensive I/O
transmission in AI training, the PCIe retimer-free design of Inspur AI servers
allows for high-speed interconnection between CPUs and GPUs for reduced
communication delays. For high-load, multi-GPU collaborative task scheduling,
data transmission between NUMA nodes and GPUs is optimized to ensure that data
I/O in training tasks is at the highest performance state.
In terms of heat dissipation,
Inspur Information takes the lead in deploying eight 500W high-end NVIDIA Tensor Core A100 GPUs in a 4U space, and supports air cooling and liquid cooling.
Inspur AI servers continue to optimize pre-training data processing
performance, and adopt combined optimization strategies to maximize AI model
training performance.
Among the closed division
benchmarks for single-node systems, Inspur Information is consistently the top
performer in natural language processing with BERT, recommendation with DLRM,
and speech recognition with RNN-T. For mainstream high-end AI servers equipped
with eight NVIDIA A100 Tensor Core GPUs, Inspur Information AI servers were top
ranked in five tasks (BERT, DLRM, RNN-T, ResNet and Mask R-CNN).
VMblog: What is the BERT model?
Wu: Pre-trained massive models
based on the Transformer neural network architecture have led to the
development of a new generation of AI algorithms. The BERT model in the MLPerf
benchmarks is based on the Transformer architecture. Transformer's concise and
stackable architecture makes the training of massive models with huge parameters
possible. This has led to a significant improvement in large model algorithms,
but necessitates higher requirements for processing performance, communication
interconnection, I/O performance, parallel extensions, topology and heat
dissipation for AI systems.
In the BERT benchmark, Inspur
Information AI servers further improved BERT training performance by using
methods including optimizing data preprocessing, improving dense parameter
communication between NVIDIA GPUs and automatic optimization of hyperparameters,
etc. Inspur Information AI servers can complete BERT model training of
approximately 330 million parameters in just 15.869 minutes using 2,850,176
pieces of data from the Wikipedia data set, a performance improvement of 309%
compared to the top performance of 49.01 minutes in Training v0.7.
VMblog: Which
Inspur AI Servers were utilized for MLPerf Training v2.0 benchmarking?
Wu: Inspur Information
NF5488A5 and
NF5688M6 AI servers achieved top scores in the MLPerf Training
v2.0.
NF5488A5 is one of the first
servers in the world to support eight NVIDIA A100 Tensor Core GPUs with NVIDIA
NVLink technology and two AMD Milan CPUs in a 4U space. It supports both liquid
cooling and air cooling. It has won a total of 40 MLPerf titles. NF5688M6 is a
scalable AI server designed for large-scale data center optimization. It
supports eight NVIDIA A100 Tensor Core GPUs and two Intel Ice Lake CPUs, up to
13 PCIe Gen4 I/O, and has won a total of 25 MLPerf titles.
VMblog: What's the future with
MLPerf and Inspur Information?
Wu: We see the relevance and
importance of MLPerf continuing for the foreseeable future as tracking and
comparing improvements in computing will continue to be critical - especially
as industries continue to widely adopt AI technology on a wide scale.
##