Making fast apps and services that scale to millions or billions of people is no simple task. Dealing with the impact of poor performance can be just as challenging, and can be measured in slow user experiences and inefficient infrastructure. History is littered with examples of projects that failed to maintain their performance as they scaled. No two performance problems are ever quite the same, but there’s a lot we can learn from one another as an industry.
Last Wednesday, hundreds of engineers came to Facebook’s Menlo Park campus for Performance @Scale, an all-day event dedicated to making technology fast and efficient. Speakers from Facebook, Google, LinkedIn, Microsoft, and Netflix came together and covered topics such as low-level system profiling, production measurements, regression detection, efficient triage, and more.
Videos and talk descriptions from the event are posted below. If you are interested in joining the next event, please check out and follow the @Scale Facebook page.
Presentation summaries and videos
Opening remarks: Evolution of performance
David Mortenson introduced us to the importance and challenges of performance at scale. As technical projects grow in scope, the engineering effort required to maintain or improve performance increases substantially. The onset of any performance effort is like picking fruit, but the most advanced efforts are more akin to rocket science. It is not easy to do rocket science, so we all benefit from an open exchange of performance tools and best practices.
Linux 4.x performance: Using BPF superpowers
Brendan Gregg from Netflix kicked off our technical talks with an in-depth presentation on the power of using BPF to analyze performance on Linux systems. The extended Berkeley Packet Filter is a relatively new profiling tool in the performance engineer’s toolbox that lets analysts run extremely efficient profiling code in a VM in the kernel. Brendan showed us how to write a BPF program, examples of some useful metrics, and a powerful way to visualize results using Flamegraphs. In particular, he demonstrated how to measure how long threads were blocked and how the threads were ultimately woken up. By following a chain of wakeup events across threads, Brendan showed how BPF and Flamegraphs could be used to root-cause the source of blocked CPU threads through user and kernel code, often all the way down to the metal.
Web speed at Facebook
Ben Maurer kept things rolling with an end-to-end overview of the challenges of web performance at scale and the solutions Facebook has designed. Web pages can have multiple stages of completion — is it done when the first pixel renders? when the last element downloads? — so the first task is to define a metric that helps reflect the user experience. At Facebook, Ben and his team have found that Time-to-Interact (minimum usable page content) and Display Done (all page content) have been good metrics for guiding their work. Ben then showed off the tools they use to measure and analyze the data they collect. Finally, Ben showed us the tools Facebook has developed to make things faster. Downloading code is on the critical path of any web request and requires several creative solutions to mitigate. The team minimizes time to first byte on the client with Early Flush, prioritizes content within a response using Big Pipe, and provides frameworks for lazy-loading JavaScript with Bootloader.
Automatic regression triaging at Facebook
Guilin Chen shifted focus to backend server efficiency. At Facebook’s scale, even small regressions can have major implications for site efficiency. The team pushes massive amounts of code to production every week, and catching regressions early — without slowing down developer speed — is a big challenge. After a quick overview of the Facebook release process, Guilin stepped through the process for identifying and fixing regressions using AutoTriage. The team starts by logging performance-tracking metrics for products that they care about. Once a regression has been observed, the team uses Stack Trace Finder to map the regression to a candidate list of offending functions. The team then uses a tool called Pushed Commit Search to locate all diffs that introduced changes to the offending functions. A Diff Ranker algorithm quickly prioritizes diffs by their likelihood of having introduced the regression. With these steps chained together into the AutoTriage system, the team has largely automated the most tedious aspects of regression analysis.
Sifting for gold: Increasing ad revenue by improving performance
Daniel Greenia continued the theme of improving the triage process to better leverage human analysis. Advertisers are the lifeblood of Google’s AdWords business, but at Google’s scale, it can be difficult to determine which advertisers are real and which are fraudsters. Daniel described the system they use to detect fraud, showing how machine learning could catch many, but not all, malicious accounts in their system. For the accounts in the “gray zone,” trained human analysts are still the last line of defense. Still, despite the scale of their automatic classifier, analysts still had more work to do than time in which to do it. After this, they were able to rank each suspicious account according to how much “good money” or “bad money” was at stake. From this a pattern emerged, producing an extremely simple formula that analysts could use to determine which accounts were worth reviewing manually, and what order to review them in. With this work, the team was able to improve their performance by several multiples with just a fraction of the effort.
The keys to actionable perf investigations
After a break for lunch, Vance Morrison from Microsoft pulled us into the world of client-side performance analysis using PerfView. Over the course of Vance’s career, he has done countless performance investigations into the .NET Runtime. From all these experiences, he has been able to distill a few key takeaways. Vance walked us through his investigative process with a live demo of PerfView. Focusing specifically on CPU time spent in threads, he showed us how to collect traces of an example app and dig into the call trees. A major difficulty of stack trace analysis is the vast quantity of data, so Vance focused on how to narrow down traces to just the meaningful parts: hiding external processes and libraries, semantically grouping subtrees, and focusing on leaf nodes instead of root nodes. From there, he showed the weakness of simple stack trace analysis when diagnosing asynchronous code. Fortunately, PerfView was able to trace execution across asynchronous boundaries, making a historically difficult problem easy. Lastly, he showed a technique for importing JSON-formatted trace data into PerfView, which makes it possible for any platform to easily export trace data to PerfView for analysis.
Evolution of high-performance networking in Chromium
After wowing the audience with some surprise sleight-of-hand magic, Jim Roskind of Google gave us a taste of the power of gathering metrics at scale to guide performance engineering. Jim started his talk with an overview of client-side histograms. Histograms in Chromium are super-fast at runtime — a “slow” setup path allocates the histogram buckets and defines their dynamic range, but after setup everything is lock-free and lightning-quick. The framework has a simple developer API for bumping up counters, which lets engineers record metrics with as few as 2-3 lines of code. After an overview of their histogram framework, Jim showed off examples of successful investigations they’ve done into DNS resolution, TCP connection latency, UDP reachability, and the efficacy of FEC. These findings influenced the design of the QUIC network protocol, which is used heavily by Google.
Real-world performance data for mobile
Michelle Filiba of Facebook introduced us to the challenges of mobile performance engineering at scale and the opportunities provided by detailed mobile telemetry. The Loom telemetry framework measures useful diagnostic information from production to help performance analysts investigate issues. The performance measuring system itself must be fast, so the team invested in an optimized lock-free ring buffer as the core logging mechanism. Michelle showed how engineers could enable different event providers in their traces, how the traces were uploaded to the server, and how engineers could look for performance issues in their traces. She left us with some example issues they’ve diagnosed using Loom, including delayed and extraneous network requests, and then described their plans to extend the system in the future.
Visualizing and optimizing real user performance on mobile
Anant Rao kept the mobile conversation going, focusing on the new visualization techniques LinkedIn has been using to find opportunities in its mobile apps. Real User Monitoring, or RUM, has been used at LinkedIn for tracking the performance of real user experiences in production. However, the team found that they were hitting up against the limits of what the data could tell them, so they sought to make RUMv2 much more detailed. The first step for the team was to deeply integrate instrumentation points into their core frameworks, such as the networking and parsing libraries. This obviated the need for product engineers to instrument their own product flows and made high-quality metrics available for interaction. The second key component was visualizing the data. Now that all interactions had production data, engineers could quickly find and fix systemic issues. Anant described examples of three such wins, before finishing with a peek at their Anomaly Detection system.
Thanks again to all the speakers who presented at Performance @Scale, and thanks to everyone who attended. We hope to see you all at more @Scale events in the future!