- FAIR has achieved noted advancements in the development of AI training hardware considered to be among the best in the world.
- We have done this through a combination of hardware expertise, partner relationships with vendors, and a significant strategic investment in AI research.
- FAIR is more than tripling its investment in GPU hardware as we focus even more on research and enable other teams across the company to use neural networks in our products and services.
- As part of our ongoing commitment to open source and open standards, we plan to contribute our innovations in GPU hardware to the Open Compute Project so others can benefit from them.
Although machine learning (ML) and artificial intelligence (AI) have been around for decades, most of the recent advances in these fields have been enabled by two trends: larger publicly available research data sets and the availability of more powerful computers — specifically ones powered by GPUs. Most of the major advances in these areas move forward in lockstep with our computational ability, as faster hardware and software allow us to explore deeper and more complex systems.
At Facebook, we’ve made great progress thus far with off-the-shelf infrastructure components and design. We’ve developed software that can read stories, answer questions about scenes, play games and even learn unspecified tasks through observing some examples. But we realized that truly tackling these problems at scale would require us to design our own systems. Today, we’re unveiling our next-generation GPU-based systems for training neural networks, which we’ve code-named “Big Sur.”
Faster, more versatile, and efficient neural network training
Big Sur is our newest Open Rack-compatible hardware designed for AI computing at a large scale. In collaboration with partners, we’ve built Big Sur to incorporate eight high-performance GPUs of up to 300 watts each, with the flexibility to configure between multiple PCI-e topologies. Leveraging NVIDIA’s Tesla Accelerated Computing Platform, Big Sur is twice as fast as our previous generation, which means we can train twice as fast and explore networks twice as large. And distributing training across eight GPUs allows us to scale the size and speed of our networks by another factor of two.
In addition to the improved performance, Big Sur is far more versatile and efficient than the off-the-shelf solutions in our previous generation. While many high-performance computing systems require special cooling and other unique infrastructure to operate, we have optimized these new servers for thermal and power efficiency, allowing us to operate them even in our own free-air cooled, Open Compute standard data centers. Big Sur was built with the NVIDIA Tesla M40 in mind but is qualified to support a wide range of PCI-e cards. We also anticipate this will achieve efficiencies in production and manufacturing, meaning we’ll get a lot more computational power per dollar we invest.
Servers can also require maintenance and hefty operational resources, so, like the other hardware in our data centers, Big Sur was designed around operational efficiency and serviceability. We’ve removed the components that don’t get used very much, and components that fail relatively frequently — such as hard drives and DIMMs — can now be removed and replaced in a few seconds. Touch points for technicians are all Pantone 375 C green, the same touch-point color as all of Facebook’s custom data center hardware, which allows technicians to intuitively identify, access and remove parts. No special training or service guide is really needed. Even the motherboard can be removed within a minute, whereas on the original AI hardware platform it would take over an hour. In fact, Big Sur is almost entirely toolless — the CPU heat sinks are the only things you need a screwdriver for.
Collaboration through open source
We plan to open-source Big Sur and will submit the design materials to the Open Compute Project (OCP). Facebook has a culture of support for open source software and hardware, and FAIR has continued that commitment by open-sourcing our code and publishing our discoveries as academic papers freely available from open-access sites. We’re very excited to add hardware designed for AI research and production to our list of contributions to the community.
We want to make it a lot easier for AI researchers to share techniques and technologies. As with all hardware systems that are released into the open, it’s our hope that others will be able to work with us to improve it. We believe that this open collaboration helps foster innovation for future designs, putting us all one step closer to building complex AI systems that bring this kind of innovation to our users and, ultimately, help us build a more open and connected world.
Thanks to all the people who helped make this happen, including William Arnold, Stephen Chan, Jia Ning, and Whitney Zhao.