For the first few years of Facebook’s existence, we served our users from data centers in a single region in Northern California. As the site grew, we added a second region of data centers in Virginia in 2007, and this year, we launched our third region in Prineville, Oregon.
Our new facility in Prineville marks a shift in our data center strategy. Previously, Facebook leased data center space from other companies. In Prineville, we built the entire data center and servers from the ground up, designing everything in tandem to achieve maximum efficiency. Last month, we shared that technology as open hardware through the Open Compute Project.
Building the facility and server hardware was a significant undertaking, but we faced another challenge: ensuring that our entire software stack would be able to evolve and work smoothly in the new region, without interrupting what our users do every day on Facebook. The solution was to simulate a third region of data centers, even before the new servers in Prineville came online. We called this effort and the simulated third region “Project Triforce.”
Some of the challenges included:
- Uncharted territory: The size and complexity of our infrastructure had increased so dramatically over the years that estimating the effort required to build a successful data center was no small task. The number of components in our infrastructure meant that testing each independently would be inadequate: it would be difficult to have confidence that we had full test coverage of all components, and unexpected interactions between components wouldn’t be tested. This required a more macro approach – we needed to test the entire infrastructure in an environment that resembled the Oregon data center as closely as possible.
- Software complexity: Facebook has hundreds of specialized back-end services that serve products like News Feed, Search and Ads. While most of these systems were designed to work with multiple data center regions, they hadn’t been tested outside of a two-region configuration.
- New configurations: Recent innovations at Facebook in using a Flashcache with MySQL allows us to achieve twice the throughput on each of our new MySQL machines, cutting our storage tier costs in half. However, this means that we need to run two MySQL instances on each machine in the new data center. This new setup was untested and required changes in the related software stacks.
- Unknown unknowns: In our large complex infrastructure, the assumption that there are only two regions has crept into the system in subtle ways over the years. Such assumptions needed to be uncovered and fixed.
- Time crunch: Our rapidly growing user base and traffic load meant we were working on a very tight schedule – there was very little time between when these machines became physically available to us and when they had to be ready to serve production traffic. This meant that we needed to have our software stack ready well before the hardware became available in Oregon.
The solution involved taking over an active production cluster of thousands of machines in Virginia and reconfiguring them to look like a third region. Because of the geographical distance, the latency between Virginia and our master region in California was far larger than the latency expected between our new data center in Prineville and California. This was actually a good thing: it stressed our software stack more than we expected the Prineville data center to, and allowed us to quickly surface any potential latency problems that could arise when our Prineville data center came online.
We wanted Project Triforce to be as close to an actual third region as possible. For example, we tested the software stack with production traffic. The only things simulated were the databases, because we didn’t want to create a full replica of our entire set of databases. This required some tricky engineering to make nearby database replicas look like a local replica for the Triforce region.
To streamline the configuration and bootstrapping of a diverse set of systems in a new data center, we built an in-house software suite name Kobold that automates most of these steps. Kobold gives our cluster deployment team the ability to build up and tear down clusters quickly, conduct synthetic load and power tests without impacting user traffic, and audit our steps along the way. Tens of thousands of servers were provisioned, imaged and brought online in less than 30 days. Production traffic was served within 60 days. Traditionally, companies turn up production traffic manually with many people over a period of weeks. Now it takes one person less than ten minutes to turn up production traffic.
Once the machines in the Triforce region were configured, we started setting up and testing each of the services. This required careful orchestration because there were complex dependencies between various services. Issues were fixed as they arose and in a little over a month, the Triforce region was ready and began taking production traffic. Since then, it has continued to serve production traffic to guard against introducing changes that might not work with more data centers.
As soon as the physical machines became available in our new data center in early 2011, work began in earnest to bring the service online. Today, only a couple of months later, we are serving production traffic out of Prineville. One of the best parts of this project is that virtually none of our users on Facebook noticed what was happening–which is exactly what we were aiming for.