Saturday, July 27, 2024

Quenching AI’s Thirst: How Liquid Cooling Tackles Infrastructure Demands

The generative AI boom has demonstrated how artificial intelligence can be a technical revolution and an environmental resource hog at the same time.

The high-end GPUs that power the data center servers supporting AI applications consume around four times as much power as the servers used for cloud applications. That computational intensity needs increasing amounts of IT infrastructure. McKinsey estimates that the power needs of U.S. data centers will reach 35 gigawatts by 2030, up from 17 gigawatts in 2022.

Growing AI infrastructure needs are banging up against current constraints in power, space, budget and water, causing headaches for IT staff and data center operators. Fortunately, adopting liquid cooling can help data centers break through these constraints.

Lack of Data Center Power

At the recent SC23 conference (the International Conference for High Performance Computing, Networking, Storage, and Analysis), power was on everyone’s minds as the No. 1 issue for current data center expansion and for building new data centers. Data centers account for about 3% of global energy usage, and AI is pushing that demand even higher.

According to the leading commercial real estate services firm, CBRE, “The chief obstruction to data center growth is not the availability of land, infrastructure, or talent. It’s local power.” It’s not just a North American problem. In Ireland, the state electricity grid authority, EirGrid, called a halt to plans for up to 30 potential data centers after limits to data center construction were put in place until 2028.

Data centers have a power usage efficiency (PuE) of about 1.57 on average, which means that approximately 37% of the energy used by the data center goes on infrastructure overhead (lighting and cooling), of which the great majority (about 90%) is spent on the cooling of the data center. The primary driver of heat in the data center is the hot CPUs in the servers. Using direct-to-chip liquid cooling instead of traditional air cooling can lower the PuE to about 1.08 and increase the amount of power available for pure compute by 70%.

Limitations on Available Space

In air-cooled data centers, one technique used by operators to manage the increased thermal density of high-powered AI servers is to try and spread them out both within a rack and across multiple racks. The idea is to reduce the number of servers within a rack so that the existing chilled air cooling system can still accommodate the hotter servers. However, there are two issues with this approach:

  • Data center space comes at a premium, and it is limited. Once a data center is built, it has a fixed amount of space that can accommodate a limited amount of server racks. Reducing the server density within a rack means that the total number of servers that a data center can hold is reduced. This is in direct conflict with the demand for more computational power in the data center.
  • AI computational algorithms often cannot tolerate latency between computational units. Spreading servers and other IT equipment and increasing the physical distance between them can impact the fidelity of AI computations.

Direct-to-chip liquid cooling brings targeted thermal management to the hot chips within servers and thus allows for high-density server configurations within a rack. This maximizes utilization of the rack and the data center floor space and enables operators to increase the computational power of their data centers (by upgrading the servers to more powerful ones) while maintaining density.

Budget Constraints

Servers utilized for AI are extremely expensive, from acquisition costs to ongoing operational expenses. A large part of the operational expense is the cost of energy required to cool the servers, and this is becoming a headache for data center operators and CFOs as the cost of energy has been rising significantly and becoming more volatile. This can drive unplanned higher OPEX, which exceeds budgets or lead to significant volatility in OPEX, which is a nightmare for CFOs.

Using direct-to-chip liquid cooling reduces the amount of energy dedicated to cooling servers by over 40% on average and thus materially reduces the impact of higher or fluctuating electricity prices on data center OPEX.

Excessive Water Usage

Data centers use significant amounts of water in their cooling systems, resulting in the loss of millions of gallons of potable water evaporated into the atmosphere. According to the U.S. Department of Energy, data centers use up about 2 liters of water per kWh for cooling. Today, just three companies (Microsoft, Google and Meta) collectively use more than twice as much water in their U.S. data centers as the entire country of Denmark.

However, by using direct-to-chip liquid cooling, we can significantly reduce the amount of evaporated water needed for server cooling. Indeed, the water use reduction directly tracks the improvement in PUE due to liquid cooling. As compared to chilled air cooling, the energy needed for data center cooling via liquid drops by 49% (from 640 kW to 460 kW). The amount of water used will drop by the same percentage.

Conclusion

As AI continues its seemingly inexorable adoption across industries, IT leaders need solutions that enable scaling up infrastructure in a power-efficient, space-optimized, cost-effective and sustainable manner. Direct-to-chip liquid cooling checks all these boxes. By removing the main impediment to denser configurations — heat — liquid cooling unlocks the ability to pack more compute into existing data centers. And by drastically cutting cooling energy use and water waste, it also allows growth with less strain on budgets and the environment. For any organization pursuing AI, the writing is clearly on the wall — liquid-cooled infrastructure must be part of the plan.

Latest