Industry executives and experts share their predictions for 2024. Read them in this 16th annual VMblog.com series exclusive.
The Growth of Machine Learning will Drive Data Centers to Introduce New Technologies that Speed Deployment and Improve Efficiency
By Bob Shine, vice
president marketing and product management at Telescent
In the technology world, 2023 was dominated by headlines
about machine learning (ML) programs such as ChatGPT and Dall-E. Large language
models (LLMs) and generative AI fascinated people with their ability to
generate text following almost any prompt, and an image created by the
generative AI program Midjourney even won an art contest. This interest in ML
has upended hyperscaler data center's growth plans, forcing them to find ways
to scale even faster than they have in the past. SemiAnalysis put this in perspective
by stating that Microsoft is currently conducting the largest infrastructure
buildout humanity has ever seen, with a planned $50 billion investment in
AI-centric data centers in 2024.
However, deploying the Graphic Processing Units (GPUs) used
for machine learning are unlike deploying traditional Central Processing Unit
(CPUs) used in data centers. As an example of the challenges, Meta froze the
development of a $1.5 billion data center in Alabama to redesign the center to
handle new AI workloads. While a new generation of hardware offers efficiency
improvements over prior generations, the rapid growth of ML and the power
demands of new GPU chips are forcing data center operators to bring in new
technologies that can deploy these new workloads quickly while greatly
improving efficiency.
New Automated Optical
Switches Will Improve Ability to Scale Data Centers Quickly
Deploying equipment in data centers is done in stages -
allowing individual data halls to be brought online and generate revenue as
soon as they are completed. However, as additional data halls are built, these
need to be connected to the prior data halls in a process called re-striping.
In the past, this re-connection of all the equipment was done by hand and could
involve disconnecting and reconnecting thousands of fiber optic interfaces.
This process was slow, and even the best technician could have an error rate of
5%, leading to rework and slowing the process down even more.
Google recently announced the use of optical circuit switches (OCS) to replace
electrical spine switches in their network architecture. According to Google,
the use of OCS reduced power consumption by over 40% while improving throughput
by 30%, incurring 30% less cost and delivering 50x less downtime than the best
alternative.
New deployments of machine learning clusters will continue
to grow to handle the demand for larger data sets with cluster sizes exceeding
10,000 GPUs and will require new interconnection technologies. 2024 will see other hyperscalers deploying
novel OCS, including high-radix robotic OCS that not only can handle over 1,000
ports per system but also can handle connections with 8 or 16 fibers per port,
leading to systems managing 10,000 fibers or more.
High Power
Consumption of GPUs will Drive the Transition to Liquid Cooling
With the newest GPU chip consuming almost an order of
magnitude more power than a traditional CPU, efficiently removing this heat
requires new and more efficient technology. While the idea of liquid cooling
has been discussed for years, a prediction is that the increased deployment of
kW scale GPUs will cause 2024 to be the year when liquid cooling is deployed in
scale.
The Need for Power
Efficiency to Run GPUs
With GPUs requiring significantly more power than CPUs, the
power available to data centers can be a limitation. This increased need for
power has even caused construction delays and restrictions in some locations
such as Ashburn, VA; Dublin, Ireland; and Singapore, where the available power
infrastructure can't meet the demand.
While the compute refresh cycle has been extended from 3
years to 5 years to offer better capital efficiency, each new generation of
processor offers improved efficiency by reducing the power required to run a
similar workload. In a recent plenary talk at the OCP conference, a speaker from Intel estimated
that data centers could improve their efficiency by up to 75% by replacing
older chips with new silicon while running the same workloads. So 2024 will see
the deployment of more efficient chips to improve the power efficiency of data
centers.
To sum up, the year 2024 is poised to witness substantial
data center growth fueled by ML workloads. However, the associated costs and
power consumption of GPUs will propel data centers to embrace innovative
technologies, including high-radix robotic optical switches and widespread
adoption of liquid cooling as well as increased pressure to reduce the energy
consumption in data centers.
##
ABOUT THE AUTHOR
Bob Shine is vice president marketing and product
management at Telescent. He brings more than 20 years of experience in
technical marketing, product management, sales, and distribution channel
management to the role. He has led the market introduction and sales of
innovative optical solutions based on advanced technologies. Shine was vice
president of sales and marketing at Cutera, director of marketing / product
management at Daylight Solutions and head of marketing at several optical
communications startup. Bob has a BS, MS (Harvard) and PhD in Applied Physics
(Stanford).