ACTON PACIFIC · FIELD NOTES ARCHITECTURE
VOLUME 01 · ARCHITECTURE

Why Your Network Isn’t Ready for AI — and What to Do About It

GPU clusters don’t generate the traffic patterns enterprise networks were designed for. The architectural gap is large, and the cost of ignoring it is a cluster you paid for and can’t fully use.

READ TIME
9 MIN
VOLUME
01 · ARCH
FIG · ARCHITECTURE

The training run doesn't care that your core was sized for 2018. An H100 cluster generates traffic patterns that bear very little resemblance to the workloads enterprise networks were designed to carry — and the gap isn't subtle.

Most enterprise data center networks are still organized around a north-south principle: clients in branch offices and home networks talk to servers in the data center, and the network is sized accordingly. Three-tier hierarchical, Spanning Tree blocking some links, oversubscription ratios well into the high range — often past 20:1 at the core. That's a sound design for the workload it was built for. It just isn't the workload that lives on your network now.

What "AI traffic" actually looks like

A model training step isn't a transaction. During the all-reduce phase of distributed training, every GPU exchanges parameter gradients with every other GPU. The traffic is east-west, bursty at line rate, and synchronized — meaning every link in the fabric peaks at the same instant. There's no statistical multiplexing benefit because there are no statistically independent flows.

For any cluster running a meaningful model, the all-reduce phase saturates every cross-leaf link in the fabric for the duration of the step. If your fabric was sized assuming modest average utilization with healthy oversubscription, the math doesn't work.

Where it breaks first

Three failure modes, in roughly the order operators encounter them.

Drops. Standard ECN/PFC behavior assumes you can pause occasionally. AI workloads don't tolerate drops — a single dropped gradient stalls the step.

Tail latency. The 99.9th percentile is the latency that actually defines training throughput. A network that meets median targets but has a fat tail will quietly halve your effective GPU utilization. The metric that matters is not the average; it's the worst flow.

Bandwidth contention with non-AI traffic. If the AI fabric shares core links with the rest of the enterprise, the rest of the enterprise becomes the noisy neighbor — or vice versa. Either way, both sides suffer.

The architectural answer

The industry has converged. The answer is a dedicated, lossless, leaf-spine fabric for the GPU domain, with a few specific properties:

  • Leaf-spine topology with full-mesh ECMP between leaves and spines, no Spanning Tree, no oversubscription on the cluster fabric.
  • Lossless behavior via PFC + ECN, tuned for the workload (not vendor defaults).
  • 400G or 800G as the new normal at the spine layer; 200G at the leaf depending on GPU density.
  • Dedicated to GPU-east-west traffic. Storage and inference traffic typically ride a separate fabric or a clearly partitioned VLAN/VRF on the same one.
  • Streaming telemetry from day one. You can't operate a lossless fabric on five-minute polling.

The vendor landscape has caught up: Arista, Cisco Nexus, Juniper, NVIDIA Spectrum-X, and a handful of merchant-silicon-based options all support the pattern. The architectural decision is more important than the vendor choice, but the vendor choice does matter — particularly around the operating model and the telemetry primitives.

What this means for the rest of your network

The biggest mistake we see is treating "AI readiness" as a network refresh. It isn't. AI readiness is a workload design decision: where will the GPUs live, what data will they consume, and what other systems do they need to talk to?

If the GPUs sit in a colocation facility connected to your data center over a 100G interconnect, the bottleneck moves there. If they sit in your existing data center, the question becomes whether to build a parallel fabric or carve a partition. If the inference workloads are going to live in the same domain as the rest of your application traffic, that's a different design again.

Where to start

Three steps, in order.

  1. Quantify the workload. Model size, GPU count, training cadence. If you don't have these numbers from the AI/ML team, the network conversation is premature.
  2. Assess the existing fabric. Specifically: maximum flow size sustained without drops, tail latency at p99.9, and oversubscription on the path the GPU domain would actually use. Most existing fabrics fail on at least one of these.
  3. Design for one workload, not the catalog. The fabric for a 256-GPU training cluster is different from the fabric for a 4,000-GPU one. Pick the workload that's actually getting funded, not the one in the keynote.

The AI mandate isn't waiting. The cost of getting the network wrong shows up as GPU utilization the business can't explain — and won't accept.

// NOTE · Field notes are illustrative pieces meant to frame decisions. Specific figures are directional and should be validated against current sources before citing in board materials, RFPs, or public-facing communication.
Scroll to Top