We recently hit a milestone of 50MM usernames a few weeks ago — in just over a month since we launched usernames on June 12. Ever since we launched usernames, we’ve had a lot of people express interest in understanding how we designed the system and prepared for this big event. In a recent post, my colleague Tom Cook wrote about the site reliability and infrastructure work that we did to ensure a smooth launch. As an extension to that post, I’ll discuss some specific application and system design issues here. Launching usernames to allow over 200 million (at the time — we’re now over 250 million) people to get a username at the same time presented some really interesting performance and site reliability challenges. The two main parts of the system that needed to scale were (1) the availability checker and (2) the username assigner. Since we were pre-generating suggestions for users, we needed to check availability of all the suggested names, which placed extra load on the availability checker.
Optimizing read and write performance
It became clear to us that the database tier would not be able to handle the huge initial load for availability checks. Even caching the results of availability check calls would not have helped much since the hit rates would be low. To solve these problems, we created a separate memcache tier to store all assigned usernames. Checking if a username is available is just a quick lookup in this memcache tier. If the lookup returns no result, we assume the name is available. This allowed us to completely eliminate any dependency on the database tier for availability checks.
To distribute the availability check load across several memcache nodes, we replicated the cache across several machines in each of our data centers. We allocated about 1TB of memory for the entire username memcache tier. This design meant that we were using memcache as the authoritative data source for checking availability. This was a non-trivial decision to make, and we had to design special fault tolerance mechanisms (described in the next section) to make this reliable. When a username is assigned, the data is written to the database for a reliable, persistent record of the transaction. The username is also added to all the nodes in the replicated memcache tier. Writing the data to multiple memcache nodes implied a slight drop in write performance, but since we expected the read load to be much higher than the write load, this was a good trade-off to make.
For detecting conflicts when multiple users try to grab a name at the same time, we used an optimistic concurrency control mechanism. This improved write performance by eliminating the need to hold locks. We also briefly considered using Bloom Filters, but quickly came to the conclusion that it wasn’t the best solution for our problem because (1) space efficiency was not a primary concern since we could fit many hundreds of millions of usernames in memory in a single machine (2) Bloom Filters can cause false positives (incorrect hits) and (3) removing items from them is not simple.
Fault tolerance
One of the issues with using memcache as the system of record for availability checks is that memcache nodes can go down. While the redundant memcache boxes provided some fault tolerance, we wanted to design a system that would allow us to bring back failed nodes easily. To enable that, we wrote a script that can populate a username memcache node from log files that contain all the assigned usernames. The log files are written to Scribe as part of assigning a username. We also used the Scribe logs to build a real-time data collector that provided a real-time report of the number of assigned usernames with a latency of just a few seconds. Another issue with memcache is that writes to memcache are not transactional and hence are not guaranteed to succeed 100% of the time. This means that we might occasionally say that a username is available when it really isn’t. Note that if the user tries to grab such a name, it will fail since the database is the ultimate source of truth. However, since this is not an ideal user experience, we made our system robust through a couple of mechanisms. First, to reduce the probability of incorrect misses, we always check a second memcache node for any miss in the first node. Second, any time that the process of assigning a username fails in the database due to an already assigned name, we will re-populate all memcache nodes with that name so that we can prevent future users from experiencing the same problem.
Load Testing
Since it was difficult to get accurate estimates on the number of users that would login at launch time to get a username, we tested our systems under fairly high load – in fact, we stress tested our system with more than 10x the load we actually saw at launch time. The load testing helped us identify several problems in our infrastructure including but not limited to (1) improperly configured networks (2) bottlenecks in our database id generation mechanism (to generated primary keys for certain objects) (3) capacity bottlenecks for write traffic originating from our east coast data center.
Contingency Planning
Since we didn’t have accurate estimates on the traffic that the launch would generate, we put together contingency plans to decrease the load on various parts of the site to give us extra capacity in core services at the expense of less essential services. Also affectionately referred to as “nuclear options”, some of these levers included disabling chat notifications, showing fewer stories in the home page and profile page and completely turning off other parts of the site such as the People You May Know service, the entire chat bar, etc. Our careful design and planning paid off on launch night and later. In the first three minutes over 200,000 people registered names, with over 1 million allocated in the first hour, and over 50MM in just over a month. Through the entire launch we had no issues handling the additional load and none of our “nuclear options” had to be used at any point. However, two of our memcache nodes went down a few weeks after launch, but our scribe log replay scripts helped us to bring them back up again. Srinivas Narayanan, an engineer at Facebook, is excited about being able to visit most of his friends’ profiles through their usernames.