Mark Callaghan joined Facebook in 2009. Now, working from home in Oregon on the database infrastructure team, he does whatever it takes to keep the database running and users like Trekkie George Takei, a favorite person to get updates from amongst Mark’s team, happy. Read on for Mark’s insights into fast-storage challenges, Facebook’s MySQL roadmap, and more.
Q: How does your team think about getting MySQL to scale on Facebook’s multi-core servers with fast storage?
A: The community has been working on multi-core issues for several years, but the fast-storage problems are especially new and fun because they introduce performance bottlenecks and require us to study parts of the server that did not require as much demand before we tried to speed them up. In the past, the database servers that MySQL were using wouldn’t do any more than a thousand evictions per second, but now we need them to do 10,000 per second. Any inefficiencies in that part of the server are magnified now, and we’re still trying to figure out how to remove them. To do this, we change one or two things at a time, see what we broke, and then change another thing so that we’re always advancing it incrementally.
Q: What work is your team doing with innoDB compression?
A: We have a lot of data, and innoDB compression could theoretically allow us to reduce the size of our database by a factor of two. MySQL/Oracle advertised this feature as one that’s meant for a less-demanding, read-oriented workload, but our workload is more complex. We have frequent writes because of all the status updates and the Likes and everything goes through the database, so we’re trying to adapt innoDB so it works on a more-challenging workload and on faster storage devices. This presents interesting challenges, since the bottlenecks and pileups that occur at the serialization point are exaggerated because there’s more work being done at that choke point than there used to be.
Q: What does a typical day look like for you?
A: I start by checking the health of the database tier, replying to email, and doing code reviews. I frequently have performance tests running and need to check results or start new tests. Then I try to use the rest of the day for programming. Half of my programming effort is devoted to fixing things that stall MySQL, and the other half is devoted to making MySQL faster.
Q: How do you make MySQL both “less slow” and “faster” at the same time?
A: I ask questions like, “If I can make it do 10 things per second today, can I make it do 20 things per second tomorrow?” For example, we used to use an algorithm that is very CPU intensive to check database pages. Another person on my team, Ryan Mack, modified it to use hardware support on X86 processors so we could profile the servers in production to see what they were doing in these computing checksums. We then realized that the newest CPUs had a faster way to do that, so we modified the MySQL to use the CRC32 for checksums. The hard part there was upgrading the servers on the fly from using the old check zones to the new checksums without taking the site down.
Q: How do you manage the responsibility of keeping the database up and running?
A: I reduce risk by focusing on making things easily de-buggable and predictable. Someone from our team is on call at all times, so we make sure to optimize the system so the people running the database can sleep at night. Finding the balance between allowing the ops team to sleep while still allowing the eng team to move fast is important.
Q: What do you think makes a database engineer great?
A: The analogy I make is like Scotty in Star Trek when he says, “We’re doing what we can.” I think that attitude boils down to curiosity and persistence. When there is a wide search space over a problem, you need to be willing to explore a large number of different systems to figure out where the problem is occurring.
At Facebook, the quality of service we get from MySQL is much better than what you might expect if you just read the manual from MySQL. Our operations team is able to work around MySQL’s imperfections in a way that allows engineering to move really fast. If you just looked at the software that we’re using and the rate at which we change, you would not expect us to get that quality of service, and the reason we do is that our engineers are very careful and very good at what they do.
Q: Why is the work you do here meaningful to you?
A: I am amazed by the value of sharing, and I want to do my part to make Facebook better to encourage that sharing. My family is in close contact via Facebook and they get a lot, or too many, pictures of my kids. I am also a huge fan of George Takei’s Page, as are many of my coworkers. George Takei has a lot of fans with us and since we’ve all Liked his Page, a while back some of us saw an update from him about an inconsistency in his Facebook experience. We realized what he was experiencing was an issue we were already trying to fix on the database side, so when we saw him post, it gave us more information that helped us get closer to resolving the issue. This allowed us to improve his experience, and in turn, the experience of everyone else on Facebook.
Q: What advice do you have for other engineers?
A: Become an expert in something, spend some time near production, and learn from your coworkers. Dealing with production database deployments for seven years has made me a better database developer, and I am much more effective because I get help from my coworkers who have expertise in Linux internals, XFS, jemalloc, and performance debugging.