Facebook recognizes its birthday each year with Friends Day, a celebration of friendships. This year, on our 12th birthday, we celebrated in a big way by shipping hundreds of millions of personalized videos to people around the world, which have received more than 2 billion views so far. (You can see an example of one below.) Thanks to some thoughtful planning on product design, product development, and infrastructure preparation, we were able to pull this off in a pretty efficient, performant, and stable way without disrupting the normal Facebook traffic.
There were three main steps in generating the videos: We needed to curate the photos for every person, render them into a video, and surface that video on the right day. As simple as that may seem, it took strategy and a lot of work to scale this for hundreds of millions of people.
Load testing, efficiency improvement, and capacity planning
We started by testing the video generation pipeline. How much load would there be on the databases pulling and ranking all the data for a person? How much CPU usage would there be in rendering a single video, and how much RAM would it take? How long would it take to do the job? How many concurrent processes could we run on a single machine? Figuring these out would allow us to determine how large of a cluster we’d need and how much time we’d need to generate all of the videos. We started by running some tests on our own profiles and, naturally, the profiles of our friends on the team. The idea was twofold: With the information, we’d be able to fine-tune the processing for pregenerating the videos, and we’d have a rough idea of the capacity we’d need. We had a ~2.8x speedup from optimizing on-screen photo times in the renderer and were ultimately able to get a 69 percent reduction in processing time from our multithread implementation.
To help with capacity allocation and distribution, we needed to take into consideration resource usage per video generated (CPU cores, memory, and time), storage IOPS for writing and reading videos, network bandwidth limitations at the cluster and backbone level, and power availability in each region. Thorough load testing allowed us to guide peak loads and identify and work around constraints to generate these videos without affecting the reliability of our production infrastructure. In the end, the load tests’ results matched closely to predictions. Along the way, we discovered and eliminated bottlenecks to achieve the maximum throughput from allocated capacity.
Keeping our infrastructure safe
Friends Day videos were created and delivered through our shared infrastructure. We used the same production systems that are used to run many services on our site, which include compute tiers, storage tiers, and the same network used internally and externally to serve Facebook. Instead of dedicating permanent capacity for Friends Day, we used an automated system that allowed us to keep production services safe. Based on the capacity requirement calculated from the load tests, the system automatically chose machines that wouldn’t disrupt other existing services. By continuously monitoring and detecting abnormalities in power consumption, CPU usage, and network bandwidth, the system automatically blocklisted offending machines and redirected jobs to another available machine somewhere else. This flexible capacity allowed us to operate silently at this scale. On top of that, a throttling mechanism on video creation and delivery was put in place to allow us to pull the product back as needed if issues arose. All of these automated systems and close eyes on our dashboards during launch day kept our infrastructure safe.
While one team was figuring out how to serve this many videos, another was working on how to generate the content of the videos. We learned in years past that, despite our best intentions, this is tricky territory. We think nothing is more important than making sure our users have a good experience, so we worked really hard, using signals available to us, to make sure that we didn’t show something unpleasant to the people receiving these videos.
Those signals came from a variety of places. If someone had used break-up checkup, the ex got nixed from consideration for that person’s video. Same with anyone people had blocked or marked as “I don’t want to see anything from this person.” We also factored in likes, comments, and tagged people.
Those were the simpler indicators to consider, but we leveraged some of our more complicated and advanced research in the effort, too. Specifically, our image-understanding technology was really useful in identifying which photos would make the most visual sense in the video. Image understanding is a hard AI problem to solve; it essentially makes it possible for a computer to recognize what’s in a photo even though collections of pixels can be hard to distinguish. In this case, our AI algorithms looked at each photo to determine a variety of things. Were people in the photo? (Apologies to dog lovers, food lovers, and meme lovers. All of those are super-popular, but we didn’t include them as “friends” for this.) Were the people easy to see? Large groups of people wouldn’t be easy to see in a quick video shot, so we tried to get close-up pictures of people and their friends for most of the video. That is, until we got to that final image of the video. We wanted a larger group for that one since the sentiment was about a group of friends. Our AI technology made it possible to evaluate for all of that.
With the content and capacity considerations taken care of, it was time to devise a rollout plan. To prevent overwhelming our CDNs and thereby resulting in bad user experience, it was really important to smartly stagger the video rollout to hundreds of million of people. Typically this isn’t too bad of a problem, because we can slowly roll out features over days or even weeks, but this had to be available worldwide on February 4 — not before.
We looked at our peak traffic on every continent and tried to alleviate the load during peak time, when there’s the most contention for resources. This was pretty easy in Asia, where there are many time zones and, thus, less overall traffic during the region’s peak. But in places like Latin America, where a majority of the users are in one time zone, things got a little trickier, and we actually had to be more aggressive in rolling out during off-peak times in a way that maximized off-peak traffic to reduce the load during peak time. This got pretty complicated when we were rolling out multiple regions at once!
We learned a lesson with the Look Back videos that we launched in years past — namely, that these videos can go viral immediately and have global traffic pouring in. Since we didn’t roll out Friends Day videos all at once, we were able to carefully control the flow of people incoming to ensure that they had a great experience. At any rate, for those 36 hours, someone was watching the traffic graphs and product health metrics at all times. Aside from a few error messages the first night (with some quick code pushes to fix), the videos launched in a fairly seamless way.
Other considerations and tactics
There are two kickers in this whole story. The first: We wrote the Android and iOS UIs in less than a month. We started in mid-November and quickly realized the code had to be done before Christmas. Why? We wanted to make sure that for those who got a new phone as a present, the code would be in the app they downloaded. This was pretty tricky when we were simultaneously drawing up data structures and APIs on the backend!
The second: This effort utilized React for web and React Native for both iOS and Android. In fact, this was one of the first React Native features in the main Facebook for Android app. React and React Native allowed us to create a complex, feature-rich, responsive editor on three platforms in a very short amount of time. Kudos to the React Native team for helping to enable this product.
Happy Friends Day!
At this scale, we know we might not have picked the exact right images for everyone, but we hope the video was easy to edit into something that celebrated your friendships. And, overall, we’re really excited that an effort at this scale rolled out as reliably as it did. We’re always honored and humbled when we can help people celebrate their friends.
Thanks to the engineers across the product and infra teams who collaborated in making this possible!
In an effort to be more inclusive in our language, we have edited this post to replace blacklist with blocklist.