Virtualization Technology News and Information
Telescent 2024 Predictions: The Growth of Machine Learning will Drive Data Centers to Introduce New Technologies that Speed Deployment and Improve Efficiency


Industry executives and experts share their predictions for 2024.  Read them in this 16th annual series exclusive.

The Growth of Machine Learning will Drive Data Centers to Introduce New Technologies that Speed Deployment and Improve Efficiency

By Bob Shine, vice president marketing and product management at Telescent

In the technology world, 2023 was dominated by headlines about machine learning (ML) programs such as ChatGPT and Dall-E. Large language models (LLMs) and generative AI fascinated people with their ability to generate text following almost any prompt, and an image created by the generative AI program Midjourney even won an art contest. This interest in ML has upended hyperscaler data center's growth plans, forcing them to find ways to scale even faster than they have in the past. SemiAnalysis put this in perspective by stating that Microsoft is currently conducting the largest infrastructure buildout humanity has ever seen, with a planned $50 billion investment in AI-centric data centers in 2024.

However, deploying the Graphic Processing Units (GPUs) used for machine learning are unlike deploying traditional Central Processing Unit (CPUs) used in data centers. As an example of the challenges, Meta froze the development of a $1.5 billion data center in Alabama to redesign the center to handle new AI workloads. While a new generation of hardware offers efficiency improvements over prior generations, the rapid growth of ML and the power demands of new GPU chips are forcing data center operators to bring in new technologies that can deploy these new workloads quickly while greatly improving efficiency.

New Automated Optical Switches Will Improve Ability to Scale Data Centers Quickly

Deploying equipment in data centers is done in stages - allowing individual data halls to be brought online and generate revenue as soon as they are completed. However, as additional data halls are built, these need to be connected to the prior data halls in a process called re-striping. In the past, this re-connection of all the equipment was done by hand and could involve disconnecting and reconnecting thousands of fiber optic interfaces. This process was slow, and even the best technician could have an error rate of 5%, leading to rework and slowing the process down even more.

Google recently announced the use of optical circuit switches (OCS) to replace electrical spine switches in their network architecture. According to Google, the use of OCS reduced power consumption by over 40% while improving throughput by 30%, incurring 30% less cost and delivering 50x less downtime than the best alternative. 

New deployments of machine learning clusters will continue to grow to handle the demand for larger data sets with cluster sizes exceeding 10,000 GPUs and will require new interconnection technologies.  2024 will see other hyperscalers deploying novel OCS, including high-radix robotic OCS that not only can handle over 1,000 ports per system but also can handle connections with 8 or 16 fibers per port, leading to systems managing 10,000 fibers or more.

High Power Consumption of GPUs will Drive the Transition to Liquid Cooling

With the newest GPU chip consuming almost an order of magnitude more power than a traditional CPU, efficiently removing this heat requires new and more efficient technology. While the idea of liquid cooling has been discussed for years, a prediction is that the increased deployment of kW scale GPUs will cause 2024 to be the year when liquid cooling is deployed in scale.

The Need for Power Efficiency to Run GPUs

With GPUs requiring significantly more power than CPUs, the power available to data centers can be a limitation. This increased need for power has even caused construction delays and restrictions in some locations such as Ashburn, VA; Dublin, Ireland; and Singapore, where the available power infrastructure can't meet the demand.

While the compute refresh cycle has been extended from 3 years to 5 years to offer better capital efficiency, each new generation of processor offers improved efficiency by reducing the power required to run a similar workload. In a recent plenary talk at the OCP conference, a speaker from Intel estimated that data centers could improve their efficiency by up to 75% by replacing older chips with new silicon while running the same workloads. So 2024 will see the deployment of more efficient chips to improve the power efficiency of data centers.

To sum up, the year 2024 is poised to witness substantial data center growth fueled by ML workloads. However, the associated costs and power consumption of GPUs will propel data centers to embrace innovative technologies, including high-radix robotic optical switches and widespread adoption of liquid cooling as well as increased pressure to reduce the energy consumption in data centers.



Bob Shine 

Bob Shine is vice president marketing and product management at Telescent. He brings more than 20 years of experience in technical marketing, product management, sales, and distribution channel management to the role. He has led the market introduction and sales of innovative optical solutions based on advanced technologies. Shine was vice president of sales and marketing at Cutera, director of marketing / product management at Daylight Solutions and head of marketing at several optical communications startup. Bob has a BS, MS (Harvard) and PhD in Applied Physics (Stanford).

Published Friday, December 01, 2023 7:36 AM by David Marshall
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<December 2023>