Tech News

Data Center Cooling. The Unseen Crisis of the 1, 100 Chip

AI chip manufacturers have introduced a silent crisis into the heart of the data center: heat. A single modern GPU or CPU package used for AI training can draw over 1,100 Watts of power, concentrating a massive thermal load into a space that traditional air-cooling systems were never designed to handle. For the 200 MW AI Factory, the cooling solution is no longer a secondary infrastructure concern; it dictates the entire facility’s architecture, cost, and ultimately, its viability.

A complete cooling solution for an AI Factory is a complex, multi-layered system that moves heat from the silicon to the rack, and finally out of the building.

Traditional Air Cooling: The Foundation

While liquid cooling is the future, nearly all AI factories still rely on core air-handling infrastructure for facility cooling, especially for networking and storage equipment.

The Chilled Water Plant: The Facility Backbone

The chilled water plant provides the necessary “cold” medium for the CRAH units and the liquid cooling systems. It is the heart of the facility-level cooling loop.

Direct Liquid Cooling (DLC): The AI Necessity

When rack densities soar past 30 kW and up to 150 kW per rack, air is no longer a viable medium. Direct Liquid Cooling (DLC) systems use a dielectric fluid to capture heat right at the chip.

Example: Supermicro’s Liquid-Cooled GPU Cluster

Hyperscale players are deploying rack-scale solutions like Supermicro’s liquid-cooled NVIDIA HGX B300 systems. These systems are designed to house up to 144 GPUs in a single rack, achieving massive power density. The entire system relies on DLC, with cold plates on all high-TDP components, all fed by an external CDU system that manages the primary and secondary plumbing loops:

  1. Primary Loop (Facility Side): Chilled water (FWS) from the central chilled water plant runs to the CDU.
  2. Secondary Loop (IT Side): The CDU contains a heat exchanger that cools the clean, dielectric fluid, which is then pumped to the racks and circulated through the cold plates on the CPUs/GPUs. This secondary loop maintains fluid temperatures above the dew point to prevent condensation.

200 MW AI Factory: Cooling Component Bill of Materials (BoM)

The cooling load for a 200 MW facility is immense. Assuming a 100 MW IT load and a 50 MW cooling load (resulting in a Power Usage Effectiveness, or PUE, of 1.5):

Component Function & Specs for 200 MW Site Estimated Quantity (IT Load 100 MW) Top Manufacturers
Chillers (Large) Provides 45°F chilled water. Capacity ~2,000 to 4,000 tons per unit. 25-35 units (in N+1/2N redundancy) Johnson Controls (YORK), Carrier, Trane, Daikin
Cooling Towers Rejects heat to the atmosphere. Often evaporative. 40-60 cells (paired with chillers) Baltimore Aircoil (BAC), Evapco, Marley (SPX)
Pumps Circulate chilled water (FWS) throughout the campus. Hundreds (Primary, Secondary, Condenser loops) Grundfos, Flowserve, Bell & Gossett
CRAH/CRAC Units Provide ambient air cooling for non-DLC equipment and humidity control. 100-200+ units (depending on facility design) Vertiv, Schneider Electric (APC), STULZ, Rittal
Coolant Distribution Unit (CDU) Facilitates the liquid-to-liquid heat exchange and pumps the IT coolant loop. ~55-65 units (1.8 MW to 2.0 MW capacity each) Vertiv, Schneider Electric (Motivair), CoolIT Systems, Asetek

The Next Wave: Cooling Innovations

Existing hybrid air/DLC systems are rapidly becoming insufficient as chip power density continues to climb. New technologies are being aggressively deployed to meet the heat demands of the AI Factory:

The race for AI dominance has thus become a race for cooling capacity, driving unprecedented innovation in fluid dynamics, heat transfer, and thermal engineering.

Exit mobile version