Join top executives in San Francisco on July 11-12 to hear how leaders are integrating and optimizing AI investments for success.. Learn more
Meta, the social media giant formerly known as Facebook, has been a pioneer in artificial intelligence (AI) for more than a decade, using it to power its products and services, such as News Feed, Facebook Ads, Messenger, and virtual reality. But as the demand for more advanced and scalable AI solutions grows, so does the need for more innovative and efficient AI infrastructure.
In it Infra@scale AI Today’s event, a one-day virtual conference hosted by Meta’s engineering and infrastructure teams, the company announced a series of new hardware and software projects that aim to support the next generation of AI applications. The event featured speakers from Meta who shared their insights and experiences on building and deploying large-scale AI systems.
Among the announcements was a new AI data center design that will be optimized for both AI training and inference, the two main phases of AI model development and execution. The new data centers will leverage Meta’s own silicon, the Meta Training and Inference Accelerator (MTIA), a chip that will help accelerate AI workloads in various domains such as computer vision, language procession natural and recommender systems.
Meta also revealed that it has already built the Research Supercluster (RSC), an artificial intelligence supercomputer that integrates 16,000 GPUs to help train large language models (LLMs) such as the FLAME projectthat Meta announced at the end of February.
“We have spent years building an advanced infrastructure for AI, and this work reflects long-term efforts that will enable even more advancements and better use of this technology in everything we do,” Meta CEO Mark Zuckerberg said in a statement.
Building AI infrastructure is what’s at stake in 2023
Meta is far from the only hyperscaler or large IT provider thinking about purpose-built AI infrastructure. In November, Microsoft and Nvidia announced a partnership for a cloud-based AI supercomputer. The system benefits (unsurprisingly) from Nvidia GPUs, connected with Nvidia’s Quantum 2 InfiniBand networking technology.
A few months later, in February, IBM detailed the details of its artificial intelligence supercomputer, codenamed Vela. IBM’s system uses x86 silicon, along with Nvidia GPUs and ethernet-based networking. Each Vela system node is equipped with eight 80 GB A100 GPUs. IBM’s goal is to build new basic models that can help meet enterprise AI needs.
Not to be outdone, Google also jumped into the AI supercomputer race with an announcement on May 10. Google’s system is using Nvidia GPUs along with custom designed Infrastructure Processing Units (IPUs) to enable fast data flow.
Meta is now also jumping into the custom silicon space with its MTIA chip. Custom AI inference chips aren’t a new thing either. Google has been building its Tensor Processing Unit (TPU) for several years and Amazon has had its own AWS inference chips since 2018.
For Meta, the need for AI inference spans multiple aspects of its operations for its social media sites, including news feeds, ranking, content understanding, and recommendations. In a video describing MTIA silicon, Meta infrastructure research scientist Amin Firoozshahian commented that traditional CPUs are not designed to handle the inference demands of the applications Meta runs. That’s why the company decided to build its own custom silicon.
“MTIA is a chip optimized for the workloads that matter to us and designed specifically for those needs,” said Firoozshahian.
Meta is also a heavy user of the open source PyTorch machine learning (ML) framework, which he originally created. Since 2022, PyTorch has been under the governance of the Linux Foundation’s PyTorch Foundation effort. Part of the goal with MTIA is to have highly optimized silicon to run large-scale Meta PyTorch workloads.
MTIA silicon is a 7nm (nanometer) process design and can provide up to 102.4 TOPS (trillion operations per second). The MTIA is part of a highly integrated approach within Meta to optimize AI operations, including networking, data center optimization and power utilization.
The data center of the future is built for AI
Meta has been building its own data center for over a decade to meet the needs of its billions of users. So far, it’s worked well, but the explosive growth in AI demands means it’s time to do more.
“Our current generation of data center designs are world class, power and energy efficient,” said Rachel Peterson, vice president of data center strategy at Meta, during a panel discussion at the Infra@scale event. “It’s actually supported us a lot through multiple generations of servers, storage, and networking, and can really serve our current AI workloads very well.”
As the use of AI in Meta grows, more computing power will be needed. Peterson noted that Meta sees a future where AI chips are expected to consume more than 5 times the power of typical Meta CPU servers. That expectation has caused Meta to rethink data center cooling and provide liquid cooling for the chips to deliver the right level of power efficiency. Allowing for adequate cooling and power to enable AI is the driving force behind Meta’s new data center designs.
“As we look to the future, it’s always about planning for the future of AI hardware and systems and how we can have the highest performing systems in our fleet,” Peterson said.
VentureBeat’s mission is to be a digital public square for technical decision makers to gain insights into transformative business technology and transact. Discover our informative sessions.