When companies like OpenAI, Google, and Meta build their colossal AI models, they don’t just buy a few powerful GPUs. They construct massive “GPU server farms,” intricate clusters spanning dozens, hundreds, or even thousands of interconnected GPUs. In these high-stakes environments, where every millisecond of data transfer impacts training time and cost, traditional Ethernet, designed for general-purpose internet traffic, simply won’t cut it.
For years, one company has largely dominated the specialized networking required for these AI behemoths: Nvidia. But a new challenger has emerged, backed by a formidable consortium of industry giants: Ultra Ethernet. This isn’t just a faster version of your office network; it’s a meticulously engineered standard poised to redefine the future of AI infrastructure.
The Problem with “Standard” Ethernet for AI
Imagine trying to build a complex structure where every single brick needs to be delivered to hundreds of builders simultaneously and perfectly in sync. If even one delivery truck gets delayed or loses a brick, the whole project grinds to a halt. That’s the challenge for AI. Training a large language model requires thousands of GPUs to exchange mind-boggling amounts of data, think petabytes, without any dropped packets or significant delays (latency).
Traditional Ethernet, while ubiquitous, was not designed for this “lossless”, high-synchronicity requirement. It handles occasional packet loss gracefully but isn’t optimized for the intense, parallel communication patterns of AI.
The Rise of Ultra Ethernet: A Unified Front
Recognizing the limitations and the need for an open alternative, industry titans like Microsoft, Meta, AMD, Broadcom, Cisco, Hewlett Packard Enterprise, Intel, and others formed the Ultra Ethernet Consortium (UEC) in 2023. Their mission? To develop an open, high-performance Ethernet-based solution that could rival, and ideally surpass, existing proprietary technologies in the AI and High-Performance Computing (HPC) space.
This consortium’s formation signals a clear intent: to chip away at the closed-ecosystem advantage Nvidia has built around its GPU-to-GPU and server-to-server communication fabric.
Deconstructing the AI Network: Scale-Up vs. Scale-Out
To understand where Ultra Ethernet fits in, we need to differentiate between two critical networking domains in an AI data center:
- Scale-Up Networking: This refers to the lightning-fast connections within a single server, connecting multiple GPUs directly to each other on the same motherboard or in a dense multi-GPU chassis. Think of it as the internal nervous system of a single brain.
- Scale-Out Networking: This involves connecting multiple servers (each potentially containing many GPUs) across an entire data center. This is the communication backbone that allows hundreds or thousands of brains to work together as one giant super-brain.
The Incumbents: Nvidia’s Dominance
Nvidia has masterfully integrated its hardware and software across both these domains:
- For Scale-Up: NVLink is Nvidia’s proprietary, high-bandwidth interconnect for direct GPU-to-GPU communication. It’s incredibly fast, allowing dozens of GPUs to act almost as one giant processor.
- For Scale-Out: InfiniBand, acquired by Nvidia with Mellanox, is the current market leader for high-performance, lossless server-to-server connections in AI and HPC clusters. Its ability to guarantee packet delivery and low latency has made it indispensable for large-scale AI training.
Ultra Ethernet’s Two-Front War: The Open Alternative
The UEC isn’t just tackling one of Nvidia’s strongholds; they’re addressing both with a two-pronged strategy:
- Ultra Ethernet (UE) for Scale-Out: This is the primary focus of the UEC, aiming to replace InfiniBand. UE builds upon the familiar Ethernet standard but introduces radical innovations to meet AI’s demands.
- UALink for Scale-Up: While technically a separate effort, UALink (Universal Aggregation Link) is the open industry’s answer to NVLink. It aims to provide an open, high-bandwidth, low-latency standard for GPU-to-GPU communication within a server, allowing for greater hardware interoperability.

Ultra Ethernet vs. RoCE: A Necessary Evolution
You might be asking, “Didn’t we already have Ethernet for HPC?” Yes, RoCE (RDMA over Converged Ethernet) has been around for some time, enabling Remote Direct Memory Access (RDMA) over standard Ethernet. RoCE allowed servers to access each other’s memory directly, bypassing the CPU, which reduced latency.
However, RoCE still relies on standard Ethernet’s flow control mechanisms, which can lead to congestion and packet drops in highly oversubscribed AI networks. Ultra Ethernet is a significant evolution beyond RoCE. It introduces entirely new protocol layers and mechanisms, such as:
- UET (Ultra Ethernet Transport): A new transport layer optimized for AI workloads, offering superior congestion control and ultra-low latency.
- Packet Spraying: Smarter data distribution across multiple network paths simultaneously to maximize bandwidth and avoid bottlenecks.
- Intelligent Congestion Management: Active mechanisms that detect and prevent congestion before it impacts performance, ensuring the “lossless” delivery critical for AI.
In essence, if RoCE was an attempt to make a race car out of a sedan, Ultra Ethernet is designing a purpose-built F1 car from the ground up, sharing only the “Ethernet” badge.
Comparative Glance: The AI Network Landscape
| Feature / Technology | NVLink (Nvidia) | UALink (Open) | InfiniBand (Nvidia) | Ultra Ethernet (Open) |
| Domain | Scale-Up | Scale-Up | Scale-Out | Scale-Out |
| Purpose | GPU-to-GPU communication within a server/node | GPU-to-GPU communication within a server/node | Server-to-server, rack-to-rack across a data center | Server-to-server, rack-to-rack across a data center |
| Key Advantage | Ultra-low latency, extremely high bandwidth for tightly coupled GPUs. | Open standard for high-speed GPU interconnect, promoting interoperability. | RDMA, adaptive routing, QoS, and advanced congestion management. | Open standard, leverages the Ethernet ecosystem, robust congestion control, high scalability, and cost-effective. |
| Proprietary? | Yes | No | Yes | No |
| Primary Competitor | UALink | NVLink | Ultra Ethernet | InfiniBand |
| Key Innovation | Direct GPU memory access, chip-to-chip links. | Developing similar capabilities as an open standard. | UET protocol, packet spraying, advanced congestion control, and very high port density. | Open standard, leverages the Ethernet ecosystem, robust congestion control, high scalability, and is cost-effective. |
The Cost of Connection: Networking’s Share in AI Clusters
The sheer amount of high-speed networking required for a large AI cluster is staggering, and it represents a significant portion of the overall build cost. While exact percentages vary wildly based on scale, architecture, and vendor, it’s not uncommon for the networking infrastructure (including switches, network interface cards (NICs), cables, and fiber interconnects) to account for 20% to 35% or even more of the total capital expenditure for a top-tier AI supercluster.
This significant cost, coupled with the desire for interoperability and choice, is a major driving force behind the Ultra Ethernet Consortium’s efforts. An open standard that can offer similar or superior performance to proprietary solutions, potentially at a lower cost, is a powerful proposition for companies investing billions in AI infrastructure.
The Future of AI Networking
The battle for the future of AI networking is heating up. Ultra Ethernet, with its promise of open standards, enhanced performance, and cost efficiency, represents a serious challenge to Nvidia’s established dominance. As AI models continue to grow exponentially, the underlying network fabric will be more critical than ever. The industry is betting that an open, collaborative approach to this challenge will ultimately foster greater innovation and accelerate the pace of AI development for everyone.
