Watch: Meta’s engineers on building network infrastructure for AI

Meta is building for the future of AI at every level – from hardware like MTIA v1, Meta’s first-generation AI inference accelerator to publicly released models like Llama 2, Meta’s next-generation large language model, as well as new generative AI (GenAI) tools like Code Llama.

Delivering next-generation AI products and services at Meta’s scale also requires a next-generation infrastructure.

The 2023 edition of Networking at Scale focused on how Meta’s engineers and researchers have been designing and operating the network infrastructure over the last several years for Meta’s AI workloads, including our numerous ranking and recommendation workloads and the immense GenAI models. They cover a wide range of topics, including physical and logical network design, custom routing and load balancing solutions, performance tuning/debugging/benchmarking, and workload simulation and planning. We also look ahead to the requirements of GenAI models coming in the next several years.

Networking for GenAI Training and Inference Clusters

Jongsoo Park, Research Scientist, Infrastructure
Petr Lapukhov, Network Engineer

Developing new GenAI technologies and incorporating them into product features is a top priority at Meta. But the sheer scale and complexity of GenAI models means new challenges for Meta’s network infrastructure.

Jongsoo Park and Petr Lapukhov discuss the unique requirements of new large language models, and how Meta’s infrastructure is changing for the new GenAI landscape.

Meta’s Network Journey to Enable AI

Hany Morsy, Network Engineer
Susana Contrera, Network Engineer

Over the years, Meta’s AI infrastructure has transitioned from CPU-based to GPU-based training due to growing AI workloads. As a result, we have deployed large-scale, distributed, network-interconnected systems to support these systems and workloads. .

Today, our training models use a RoCE-based network fabric with a CLOS topology, where leaf switches are connected to GPU hosts and spine switches provide the Scale-Out connectivity to GPUs in the cluster.

Hany Morsy and Susana Contrera delve into how Meta’s network builds have evolved to support the needs of AI services. Along the way, they share challenges encountered, new solutions that were implemented, and the strategic considerations that have gone into building Meta’s high-performance, efficient network fabric for AI workloads.

Scaling RoCE Networks for AI Training

Adi Gangidi, Production Network Engineer

Adi Gangidi provides an overview of Meta’s RDMA deployment based on RoCEV2 transport for supporting our production AI training infrastructure. He sheds light on how Meta’s infrastructure is designed to both maximize the raw performance and consistency that is fundamental for AI-related workloads.

The talk also covers challenges in the routing, transport, and hardware layers that were solved along the way to scale Meta’s infrastructure, as well as opportunities for further progress over the next few years.

Traffic Engineering for AI Training Networks

Shuqiang Zhang, Software Engineer
Jingyi Yang, Software Engineer

Meta has been operating RoCE-based distributed training clusters to serve its internal AI training workloads since 2020. But those early days saw challenges around maintaining job performance consistency.

Shuqiang Zhang and Jingyi Yang discuss centralized traffic engineering, one of Meta’s solutions to this challenge, which dynamically places traffic over all available paths in a load-balanced manner. They go over the centralized traffic engineering solution’s design, development, evaluation, and operational experience.

Network Observability for AI/HPC Training Workflows

Shengbao Zheng, Research Scientist

Having high-performance and reliable collective communication over Meta’s AI-Zone RDMA network is foundational for enabling and scaling Meta’s AI training and inference workloads. To facilitate this, it’s necessary to capture top-down observability from workload to network for collective communication – this allows us to attribute performance regression and training failures to the backend network when appropriate.

Meta has introduced two important tools for this: The ROCET and PARAM benchmarks and Chakra ecosystems. We build ROCET to associate the job to RDMA network metrics and provide analysis on top. In addition, we built the PARAM benchmark to allow for analyzing and tuning collective communication operations through workload trace. We recently shared these systems with the community via Chakra, which allows for co-designing efficient distributed ML systems. In this talk, Shengbao Zheng discusses the design and use cases for these tools.

Arcadia: End-to-end AI System Performance Simulator: Fostering data-driven decision-making processes and promoting the future evolution of AI systems

Zhaodong Wang, Research Scientist
Satyajeet Singh Ahuja, Networking Modeling and Optimization Engineer

Arcadia is a unified system designed to simulate the compute, memory, and network performance of AI training clusters. By providing a multi-disciplinary performance analysis framework, Arcadia aims to facilitate the design and optimization of various system levels, including application, network, and hardware.

With Arcadia, researchers and practitioners can gain valuable insights into the performance of future AI models and workloads on specific infrastructures and make data-driven decisions around how models and hardware will evolve in the future.

Arcadia allows Meta’s engineers to simulate the performance impact of scheduled operational tasks on AI-models that are running in production and helps them make job-aware decisions during day-to-day operational activity.

Zhaodong Wang and Satyajeet Singh Ahuja discuss Arcadia’s capabilities and its potential impact in advancing the field of AI systems and infrastructure.