Arista Networks introduced advanced capabilities to maximize AI cluster performance and efficiency. Cluster Load Balancing (CLB) in Arista EOS maximizes AI workload performance with consistent, low-latency network flows, while Arista CloudVision Universal Network Observability (CV UNO)
now offers AI job-centric observability for enhanced troubleshooting
and rapid issue inference ensuring job completion reliability at scale.
Powering Smart AI Networking
The Arista EOS Smart AI Suite is designed for AI-grade robustness and protection and empowers AI clusters with an innovation called Cluster Load Balancing -
a new Ethernet-based AI load balancing solution based on RDMA queue
pairs that enables high bandwidth utilization between spines and leaves.
AI clusters usually have low quantities of large bandwidth flows. Basic
load balancing methods are often inefficient for AI workloads,
resulting in uneven traffic distribution and increased tail latency. CLB
addresses this by using RDMA-aware flow placement, to ensure uniform
high performance for all flows while keeping tail latency low. CLB takes
a global approach, optimizing traffic flow in both directions,
leaf-to-spine and spine-to-leaf, ensuring balanced utilization and
consistent low latency.
"As Oracle continues to grow its AI infrastructure leveraging Arista
switches, we see a need for advanced load balancing techniques to help
avoid flow contentions and increase throughput in ML networks," said Jag
Brar, vice president and Distinguished Engineer, Oracle Cloud
Infrastructure. "Arista's Cluster Load Balancing feature helps do that."
Holistic AI Observability
CV UNO, the AI-driven 3600 Network Observability platform
powered by Arista AVA, delivers seamless, end-to-end AI job visibility
by unifying network, system, and AI job data within the Arista Network Data Lake (NetDL).
EOS NetDL Streamer, a real-time telemetry framework that continuously
streams granular network data from Arista switches into NetDL. Unlike
traditional SNMP polling, which relies on periodic queries and can miss
critical updates, the EOS NetDL Streamer provides low-latency,
high-frequency, event-driven insights into network performance, key to
supercharging large-scale AI training and inferencing infrastructure.
Designed for AI accelerator clusters, it accelerates impact analysis,
pinpoints issues with precision, and enables rapid resolution-ensuring
job completion times are minimized. Some of the key benefits include:
-
AI Job Monitoring - Unlocks a comprehensive view of AI job health
metrics, including job completion times, congestion indicators
(ECN-marked packets, PFC pause frames, packet drops), and buffer/link
utilization for real-time insights.
-
Deep-Dive Analytics - Uncovers critical job-specific insights by
analyzing network devices, server NICs (e.g., PFC out-of-sync events,
RDMA errors, PCIe fatal errors), and associated flows - pinpointing
performance bottlenecks with precision.
-
Flow Visualization - Harnesses the power of CV topology mapping
to gain real-time, intuitive visibility into AI job flows at microsecond
granularity - accelerating issue inference and resolution.
-
Proactive Resolution - Detects anomalies early and correlates
network and compute performance within NetDL - ensuring uninterrupted,
high-efficiency AI workload execution.
Arista AI Centers Driven by AVA
Arista's Etherlink AI Platforms deliver ultra-high-performance,
standards-based Ethernet systems for next-gen AI networks. Offering
800G/400G fixed, modular, and distributed platforms that are
forward-compatible with Ultra Ethernet Consortium (UEC), Etherlink
scales from small AI clusters to massive deployments with 100,000+
accelerators. Arista features the AI Analyzer, powered by Arista AVA,
which delivers high-resolution traffic data at 100-microsecond
intervals, enabling precise performance optimization and
troubleshooting. This allows network administrators to optimize
performance, quickly troubleshoot issues, and make informed decisions
for AI-driven networks. Arista AVA also powers a remote EOS AI Agent, that streams
telemetry from SuperNICs or servers to NetDL, ensuring seamless network
monitoring, debugging, and QoS consistency across the entire stack.
Availability
- CLB
- Available today on 7260X3, 7280R3, 7500R3 and 7800R3 platforms.
- Support on 7060X6 and 7060X5 platforms scheduled for Q2 2025
- Support for 7800R4 scheduled for 2H 2025
- CV UNO is available today. The observability enhancements for AI are
in active customer trials, with general availability scheduled for Q2
2025