Rows of equipment in the Meta AI Research SuperCluster, Meta AI Research SuperCluster (RSC), a new supercomputer to enable new AI models.
Article Source: Data Center Frontier
Photo Credit: Meta
Meta’s vision of an immersive metaverse will require powerful hardware to process the artificial intelligence (AI) to create these digital worlds. The massive data centers that support the metaverse will feature lots of liquid cooling.
At today’s Open Compute Summit, Meta introduced a new AI computing platform, along with updates to its Open Rack and a roadmap for a gradual shift to a water-cooled AI infrastructure. The company plans to use cold plates to provide direct-to-chip cooling for AI workloads on its GPU servers, and is preparing several designs for managing the temperature of supply water as rack power densities increase.
“The power trend increases we are seeing, and the need for liquid cooling advances, are forcing us to think differently about all elements of our platform, rack and power, and data center design,” writes Alexis Bjorlin, Meta Vice President for Engineering, in a blog post accompanying her keynote today at the Summit in San Jose. “As we move into the next computing platform, the metaverse, the need for new open innovations to power AI becomes even clearer.”
In her keynote, Bjorlin unveiled several innovations that will advance Meta’s ambitions:
- The Grand Teton platform, a next-generation GPU-based hardware platform designed to offer twice the compute power and enhanced memory-bandwidth – along with two times the power envelope of predecessor Meta AI systems.
- Open Rack v3, with new features to offer flexibility in how users configure their power and cooling infrastructure, along with longer on-rack backup power.
- As early look at the Air-Assisted Liquid Cooling design that will bring chip-level liquid cooling into Meta data centers.
A New Phase for Meta’s Infrastructure
Today’s announcements at the OCP Summit mark the latest evolution in data center design for Meta, which operates more than 40 million square feet of data centers and says it has 47 data centers under construction across its global network.
Due to the scale of its operations, a shift to liquid cooling by Meta is likely to boost demand for advanced cooling in the OCP ecosystem, and perhaps beyond. A large buyer like Meta could give a shot in the arm to liquid cooling, which has been focused on high-performance computing (HPC) and supercomputing. Google has already shifted its AI infrastructure to liquid cooling, while Microsoft is testing immersion cooling in its production data centers.
Earlier this year Meta revealed a new facility to house its Research SuperCluster (RSC), will likely become the fastest AI system in the world when it is completed later this year. Much of the GPU-powered infrastructure in that system is air-cooled, but the facility’s InfiniBand network uses a liquid-to-liquid cooling distribution unit.
By embracing Air-Assisted Liquid Cooling (AALC), Meta will begin using cold plates to provide direct-to-chip liquid cooling within their existing data hall design, without the need to install a raised floor or piping to deliver water from outside cooling sources. AALC uses a closed-loop cooling system with a rear-door heat exchanger. The cool air from the existing room-level cooling passes through the rear door, cooling the hot water exiting the server. An RPU (Reservoir & Pumping Unit) pumping system housed in an adjacent rack keeps the water moving through the cold plates and heat exchanger.
Meta and Microsoft have been working together on prototypes for AALC that could support up to 40kW of power density, which they demonstrated at last year’s OCP Summit. Last fall an AALC rack design was introduced by Delta ICT, which develops OCP designs for hyperscale users.
A roadmap released with the blog post indicates that Meta plans to begin a shift to AALC, and expects to see power usage increase as its AI gear adds more power for high-bandwidth memory, which will prompt a shift to a “facility water” strategy as thermal loads exceed the limits of the rear-door heat exchanger. That next phase will likely require the addition of piping to bring chilled water to the rack.
Meta’s strategy allows it to add higher-density workloads within its current data centers, while working out the details of a next-generation design to transition to facility water supplies and the additional infrastructure that will require. Meta did not indicate when it is implementing the AALC design in production, how widely the design would be used in its infrastructure, or when it contemplates a shift to add facility water.
A virtual demo of Meta’s data center hardware is available at MetaInfraHardware.com, which offers the option of using a web interface or Meta Quest VR goggles for the tour, which provides a visual overview of the components of the AALC rack and how it works.
OCP Open Rack v3
A key component in this roadmap is the Open Rack v3, which was unveiled at today’s event after years of development. The Open Rack v3 (ORV3) design accommodates multiple configurations for both power and cooling, providing a flexible building block for hyperscale deployments.
“The ORV3 ecosystem has been designed to accommodate several different forms of liquid cooling strategies, including air-assisted liquid cooling and facility water cooling,” Bjorlin wrote in the blog post. “The ORV3 ecosystem also includes an optional blind mate liquid cooling interface design, providing dripless connections between the IT gear and the liquid manifold, which allows for easier servicing and installation of the IT gear.”
The Open Rack v3 is designed to bring 48V power to the equipment for higher efficiency, and a taller design that supports the addition of liquid cooling infrastructure.
Meta Grand Teton GPU-powered AI Hardware
Meta’s new Grand Teton AI hardware was showcased in today’s presentation by Bjorlin, who previously worked in the silicon operations at Broadcom and Intel.
“We’re excited to announce Grand Teton, our next-generation platform for AI at scale that we’ll contribute to the OCP community,” said Björlin. “As with other technologies, we’ve been diligently bringing AI platforms to the OCP community for many years and look forward to continued partnership.”
Grand Teton uses NVIDIA H100 Tensor Core GPUs to train and run AI models that are rapidly growing in their size and capabilities, requiring greater compute. The NVIDIA Hopper architecture, on which the H100 is based, includes a Transformer Engine to accelerate work on these neural networks, which are often called foundation models because they can address an expanding set of applications from natural language processing to healthcare, robotics and more.
“With Meta sharing the H100-powered Grand Teton platform, system builders around the world will soon have access to an open design for hyperscale data center compute infrastructure to supercharge AI across industries,” said Ian Buck, vice president of hyperscale and high performance computing at NVIDIA.
Grand Teton sports 2x the network bandwidth and 4x the bandwidth between host processors and GPU accelerators compared to Meta’s prior Zion system, Meta said.
More details of the Meta OCP announcements are available at the Meta Engineering Blog.