Schedule a Consultation

Every few years, a technology shift comes along that doesn’t just add load to your network — it fundamentally changes the kind of load your network needs to handle. The move to AI infrastructure is one of those shifts. And for the vast majority of enterprise IT organizations, the network they’ve built and maintained over the past decade is not equipped to handle it.

That’s not a criticism. Enterprise networks were built for the workloads that existed when they were designed: client-server traffic, internet-bound flows, SaaS connectivity, and east-west traffic between application tiers in a traditional three-tier architecture. Those patterns are well understood, and most modern enterprise networks handle them reasonably well.

AI workloads are a different problem entirely.

What Makes AI Traffic Different

The core issue comes down to traffic patterns. Traditional enterprise applications generate traffic that flows primarily north-south — from client to server, from server to internet, from branch to data center. Even microservices architectures, while more east-west in nature, involve relatively modest bandwidth between individual services with tolerant latency requirements.

GPU clusters doing AI training generate something entirely different. When a model is being trained across dozens or hundreds of GPUs, those GPUs need to communicate with each other constantly — exchanging gradients, synchronizing weights, and coordinating compute in ways that saturate high-bandwidth links with all-to-all traffic. The traffic is east-west, it’s high-volume, it’s bursty, and it’s extremely sensitive to latency and packet loss.

To make that communication efficient, modern AI infrastructure relies on RDMA over Converged Ethernet (RoCEv2) — a protocol that allows GPUs to write directly into each other’s memory without CPU involvement. This dramatically reduces latency and CPU overhead. But RoCEv2 has a hard dependency: it requires a lossless Ethernet fabric.

This is where most enterprise networks fail immediately.

The Lossless Ethernet Problem

Standard Ethernet is a lossy medium. When congestion occurs, packets are dropped and retransmitted. For TCP workloads, this is generally tolerable — TCP handles loss gracefully. For RoCEv2, packet loss is catastrophic. A dropped packet triggers a full queue pair reset that cascades into significant latency spikes and throughput degradation across the entire training job.

Building a lossless fabric requires two things most enterprise switches are not configured for:

Priority Flow Control (PFC) — a mechanism that allows a switch to signal upstream devices to pause transmission on a specific traffic class. This prevents drops at the cost of introducing back-pressure into the network.

Explicit Congestion Notification (ECN) — a complementary mechanism that marks packets as experiencing congestion before drops occur, allowing endpoints to reduce their transmission rate proactively.

These features exist on many modern switching platforms, but they are rarely enabled, almost never tuned, and frequently not understood well enough by enterprise network teams to be deployed safely. PFC misconfiguration can trigger PFC deadlocks — a condition where back-pressure propagates in a loop and the fabric grinds to a halt.

Bandwidth and Latency Requirements

Beyond lossless behavior, AI workloads demand bandwidth scales that enterprise networks weren’t sized for. GPU interconnects within a single server (via NVLink) operate at hundreds of GB/s. The Ethernet fabric connecting servers in an AI cluster needs to match this — or come as close to it as the economics allow.

400 Gigabit Ethernet (400GbE) is becoming the baseline for GPU-to-Top-of-Rack connectivity in purpose-built AI clusters. 800GbE deployments are already underway at hyperscalers. Most enterprise switching environments still have significant populations of 10GbE and 25GbE server-facing ports.

Latency expectations are equally demanding. Cut-through switching architectures, shallow buffer profiles for latency-sensitive traffic, and careful ECMP design to eliminate asymmetric paths are all required — and all require deliberate engineering choices that most enterprise network architects have not had reason to make until now.

A Practical Assessment Framework

If you’re evaluating whether your network can support AI workloads — even at a modest scale, such as a small GPU cluster for inference or experimentation — here are the questions to ask:

1. Does your switching hardware support PFC and ECN? Check the datasheets for your existing Top-of-Rack and aggregation switches. Most modern merchant silicon platforms (Broadcom Tomahawk, Trident, Tofino; Intel Tofino) support these features. Older ASICs may not, or may have limitations.

2. What are your server-facing port speeds? If you’re connecting GPU servers via 10GbE, you have a problem regardless of the rest of the fabric. At minimum, 100GbE server-facing links are required for meaningful GPU cluster connectivity. 400GbE is preferred for serious workloads.

3. Does your team have experience configuring lossless fabrics? This is often the hardest gap to close. PFC and ECN configuration is not intuitive, and the failure modes are severe. If your team hasn’t done this before, either plan for a learning curve or bring in someone who has.

4. Is your fabric ECMP-capable with sufficient path diversity? AI training traffic is all-to-all. You need multiple equal-cost paths between every pair of servers, and your fabric needs to hash traffic effectively across those paths to avoid hot spots.

5. Do you have visibility into microbursts? AI workloads generate extremely bursty traffic that can saturate links for milliseconds at a time — too short for standard SNMP polling to catch, but long enough to cause congestion and loss. You need telemetry solutions capable of detecting microsecond-level events.

What to Do About It

If your assessment reveals significant gaps, the path forward depends on how serious your AI ambitions are.

For organizations doing limited GPU-based work — a handful of servers for inference or experimentation — the most practical approach is often to build a small, isolated AI fabric rather than attempt to retrofit your existing network. Deploy a purpose-configured leaf-spine fabric for AI traffic, keep it separate from general enterprise traffic, and connect it to the rest of your environment with careful segmentation.

For organizations planning meaningful AI training deployments, a more comprehensive network redesign is warranted. This involves hardware refresh, architectural redesign, and — critically — investment in the operational knowledge to run a lossless fabric reliably.

What I’d caution against is the middle path: enabling PFC on your existing network, improperly configured, without adequate testing and monitoring. The failure mode is a network-wide outage that affects not just your AI workload but everything else on the fabric.

The network is not the exciting part of AI infrastructure. It rarely gets the budget attention it deserves, and most AI planning conversations focus on GPU compute, storage, and software stack. But when the network doesn’t work, nothing else does either. Getting it right requires treating it as a first-class concern — not an afterthought to be addressed after the GPUs are already racked.


Alan Sukiennik is the founder of Acton Pacific Strategies, a Las Vegas-based independent infrastructure advisory firm. He has 30 years of experience in enterprise and service provider networking, including senior engineering roles at Arista Networks, F5 Networks, BlueCat Networks, and Nokia. Reach him at alan@actonpacific.com or schedule a consultation.

Leave a Reply

Your email address will not be published. Required fields are marked *