Virtualization Technology News and Information
Article
RSS
VMblog Expert Interview: Shawn Wu of Inspur Explores MLPerf, Benchmarking and Futures

interview-inspur-wu 

As artificial intelligence (AI) applications are more widely adopted, businesses are increasingly looking for the right IT infrastructure solutions to run these applications and algorithms. This infrastructure represents a significant investment, and organizations need a reliable way to parse through the offerings of multiple vendors to understand the best value and performance of the systems on the market. 

MLPerf has emerged as one of the best tools for making this type of evaluation. Founded by researchers and engineers from Baidu, Google, Harvard University, Stanford University, and the University of California, Berkeley, MLPerf plays a critical role in the industry by establishing industry benchmarks. Shawn Wu, Chief Researcher at Inspur Information, recently spoke with VMblog about MLPerf, how Inspur Information continues to excel in the industry benchmarks and the importance of MLPerf going forward.

VMblog:  What is MLPerf?

Shawn Wu:  MLPerf is a machine learning (ML) performance benchmark that evaluates the quality, speed and efficiency of computing nodes using typical workloads such as training models, tasks and datasets. The closed division of the open-source and peer-reviewed benchmark suite provides a level playing field that drives innovation, performance, and energy-efficiency for the entire industry. MLPerf is independently managed by MLCommons, an open engineering consortium, and focuses on full system tests that stress machine learning models, software, and hardware for a broad range of applications. 

VMblog:  How open and diverse is the participation in MLPerf benchmarking?

Wu:  MLPerf is one of the world's most influential benchmark for AI performance, with members from more than 50 global leading AI companies and top academic institutions, including Facebook, Google, Inspur Information, Intel, NVIDIA, and others previously mentioned. 

The latest MLPerf Training v2.0 attracted 21 global manufacturers and research institutions, a new record for number of participants. There were 264 submissions, a 50% increase over the previous round. The eight AI benchmarks cover current mainstream usage AI scenarios, including image classification with ResNet, medical image segmentation with 3D U-Net, light-weight object detection with RetinaNet, heavy-weight object detection with Mask R-CNN, speech recognition with RNN-T, natural language processing with BERT, recommendation with DLRM, and reinforcement learning with MiniGo.

VMblog:  Why is MLPerf benchmarking important?

Wu:  MLPerf AI Training and Inference benchmarks are both held twice a year to track improvements in computing performance and provide authoritative data guidance for users. 

MLPerf measures the performance of training machine learning models, enabling researchers to unlock new capabilities such as collision avoidance for vehicles, robotics, medical radiological diagnosis, retail analytics, and many others.

The latest results from MLPerf Training v2.0, demonstrated up to 1.8X greater performance compared to previous results, paving the way for more capable intelligent systems to benefit society at large. These consistent measurements enable engineers to design reliable products and services, and enable researchers to compare innovations and choose the best solutions to drive the innovations of tomorrow.

VMblog:  How does Inspur Information perform in the MLPerf benchmarking?

Wu:  Inspur AI servers continue to achieve AI performance breakthroughs through comprehensive software and hardware optimization. Compared to the MLPerf v0.5 results in 2018, Inspur AI servers showed significant performance improvements of up to 789% for typical 8-GPU server models. 

The leading performance of Inspur AI servers in MLPerf is a result of design innovation and full-stack optimization capabilities for AI. Focusing on the bottleneck of intensive I/O transmission in AI training, the PCIe retimer-free design of Inspur AI servers allows for high-speed interconnection between CPUs and GPUs for reduced communication delays. For high-load, multi-GPU collaborative task scheduling, data transmission between NUMA nodes and GPUs is optimized to ensure that data I/O in training tasks is at the highest performance state.

In terms of heat dissipation, Inspur Information takes the lead in deploying eight 500W high-end NVIDIA Tensor Core A100 GPUs in a 4U space, and supports air cooling and liquid cooling. Inspur AI servers continue to optimize pre-training data processing performance, and adopt combined optimization strategies to maximize AI model training performance.

Among the closed division benchmarks for single-node systems, Inspur Information is consistently the top performer in natural language processing with BERT, recommendation with DLRM, and speech recognition with RNN-T. For mainstream high-end AI servers equipped with eight NVIDIA A100 Tensor Core GPUs, Inspur Information AI servers were top ranked in five tasks (BERT, DLRM, RNN-T, ResNet and Mask R-CNN).

VMblog:  What is the BERT model?

Wu:  Pre-trained massive models based on the Transformer neural network architecture have led to the development of a new generation of AI algorithms. The BERT model in the MLPerf benchmarks is based on the Transformer architecture. Transformer's concise and stackable architecture makes the training of massive models with huge parameters possible. This has led to a significant improvement in large model algorithms, but necessitates higher requirements for processing performance, communication interconnection, I/O performance, parallel extensions, topology and heat dissipation for AI systems. 

In the BERT benchmark, Inspur Information AI servers further improved BERT training performance by using methods including optimizing data preprocessing, improving dense parameter communication between NVIDIA GPUs and automatic optimization of hyperparameters, etc. Inspur Information AI servers can complete BERT model training of approximately 330 million parameters in just 15.869 minutes using 2,850,176 pieces of data from the Wikipedia data set, a performance improvement of 309% compared to the top performance of 49.01 minutes in Training v0.7.

VMblog:  Which Inspur AI Servers were utilized for MLPerf Training v2.0 benchmarking?

Wu:  Inspur Information NF5488A5 and NF5688M6 AI servers achieved top scores in the MLPerf Training v2.0.

NF5488A5 is one of the first servers in the world to support eight NVIDIA A100 Tensor Core GPUs with NVIDIA NVLink technology and two AMD Milan CPUs in a 4U space. It supports both liquid cooling and air cooling. It has won a total of 40 MLPerf titles. NF5688M6 is a scalable AI server designed for large-scale data center optimization. It supports eight NVIDIA A100 Tensor Core GPUs and two Intel Ice Lake CPUs, up to 13 PCIe Gen4 I/O, and has won a total of 25 MLPerf titles.

VMblog:  What's the future with MLPerf and Inspur Information?

Wu:  We see the relevance and importance of MLPerf continuing for the foreseeable future as tracking and comparing improvements in computing will continue to be critical - especially as industries continue to widely adopt AI technology on a wide scale.  

##

Published Friday, August 12, 2022 7:30 AM by David Marshall
Filed under: ,
Comments
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
Calendar
<August 2022>
SuMoTuWeThFrSa
31123456
78910111213
14151617181920
21222324252627
28293031123
45678910