Virtualization Technology News and Information
VMblog's Expert Interviews: Univa Leverages AWS to Deploy More than One Million Cores in a Single Univa Grid Engine Cluster


There are a number of interesting announcements and innovations coming out of the ISC High Performance 2018 show this week in Frankfurt, Germany, and one of the exhibitors, Univa, a provider of on-premises and hybrid cloud workload management solutions for enterprise HPC customers, is showing how it's leveraged AWS to deploy more than one million cores in a single Univa Grid Engine cluster to showcase the advantages of running large-scale electronic design automation (EDA) workloads in the cloud.  To find out more, I caught up with Rob Lalonde, vice president and general manager of Navops at Univa.

VMblog:  What are you demonstrating at the show?

Rob Lalonde:  Univa is focusing on extreme scale automation by deploying 1,015,022 cores in a single Univa Grid Engine cluster to showcase the advantages of running large-scale workloads in the cloud. The cluster was built in approximately 2.5 hours using Navops Launch automation and comprised more than 55,000 AWS instances in 3 availability zones, 16 different instance types and leveraged AWS Spot Fleet technology to maximize the rate at which Amazon EC2 hosts were launched, while enabling capacity and costs to be managed according to policy.

VMblog:  This is an interesting use of AWS for HPC, what is the significance?

Lalonde:  It's not a surprise that many large organizations have an insatiable appetite for computing power. Yet with large clusters, the challenge is not just building the cluster, but rather it is building it quickly and reliably using full automation. Our Navops Launch solution can provision and manage both virtual and bare-metal environments and includes a cloud-specific adapter for Amazon EC2. Likewise, Navops Launch policy automation allows organizations to dynamically create, scale, and tear-down cloud-based infrastructure in response to changing workload demand.

VMblog:  Are million core clusters something unique to the HPC industry?

Lalonde:  Million core clusters are not entirely unique to the industry but a cluster of this scale that runs with a single cluster master, this is a first. There are only four clusters on the global Top500 that feature more than 1,000,000 cores, and all of these owe their large core counts to GPUs or many-core processor designs. To give you some perspective, the recently announced ORNL Summit Supercomputer (regarded as the world's largest) has 202,750 Power9 cores (excluding cores on the Nvidia Volta GPUs). Conventional processor core counts are more representative of the scale of the provisioning and workload management challenges in fields like EDA, life sciences or financial Services.

VMblog:  What types of industries are you seeing a demand for such compute-intensive workloads?

Lalonde:  Naturally, markets that demand extensive computer simulation in order to build quality products and get to market faster include life sciences, deep learning, and semiconductor design industries. Enterprises compete in-part based on the scale, performance, and cost-efficiency of their HPC environments.

In chip design, for example, device validation and regression testing are massively compute-intensive operations. Modern VLSI and system-on-a-chip designs are comprised of millions of gates, and any minor design change requires that millions of digital and analog simulations be re-run to ensure that the device continues to function and has not "regressed." With tape out costs in the range of 10-15 million dollars, organizations cannot afford to make a mistake in device design and verification.

VMblog:  What else will Univa be demonstrating at the ISC show this week?

Lalonde:  On top of the million-core cluster, the Univa team has been working hard to showcase several other use cases for ISC this year.  To help illustrate to users how they can accelerate HPC cloud migration for their organization using Univa Grid Engine and Navops Launch, we are showcasing Mellanox Technologies that demonstrates how they were able to extend their cluster to hybrid cloud in a very seamless and automated fashion.  We are also discussing our extreme-scale deep learning customers like Tusbame3 and ABCI, who are running some of the largest NVIDIA and machine learning clusters in the world.



Published Thursday, June 28, 2018 7:15 AM by David Marshall
Filed under: ,
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<June 2018>