Last week we hosted our second Performance @Scale conference. Held at Facebook’s Menlo Park campus, the event brought together performance engineering experts from multiple domains to discuss managing the performance challenges inherent in providing services for millions or even billions of people.
More than 275 attendees gathered to hear from performance experts from Alibaba, Facebook, Headspin, LinkedIn, Microsoft, and Netflix to talk about the biggest performance challenges they face at their companies. Topics included anomaly detection, scaling web services, and speeding up mobile apps.
For a recap of the conference and the presentations, check out the videos below. If you’re interested in joining the next event, visit the @Scale website or join the @Scale community.
Welcome Remarks
Bill Jia, Vice President of Infastructure, Facebook
Bill oversees AI/Applied Machine Learning infrastructure, performance and capacity engineering, and hardware validation engineering at Facebook. Since joining Facebook in 2009, Bill has been responsible for defining Facebook’s software product performance strategy and optimizing and planning infrastructure for Facebook.
Execution Graphs: Distributed trace processing for performance regression detection
Andre Vachon, Partner Engineering Manager, Microsoft
Execution Graphs are a data correlation and visualization model used within Azure to understand the performance and reliability of Azure VM deployment operations. Within Azure, multiple services are responsible for various aspects of VM creation and startup. Andre describes the system that understands the various contracts between services and builds complete traces for each VM operation, including timing information. The analytics and UI on top of these graphs enable engineers to easily debug failures and measure timing of sub-operations at scale.
Understanding performance in the wild
Delyan Kratunov, Software Engineer, Facebook
Understanding the performance of an app out in the world is a challenge for teams, tools, and individual developers. Delyan shares his experience at Facebook, including stories of what’s worked, what hasn’t, and how they are thinking about performance today. He also announced the open source release of Profilo, a high-throughput, mobile-first performance tracing library developed at Facebook.
Robust anomaly detection for real user monitoring data
Yang Yang, Senior Staff Software Engineer & Data Scientist, LinkedIn, Ritesh Maheshwari, Senior Staff Software Engineer, LinkedIn
LinkedIn has developed a generic anomaly detection platform for time series metrics called ThirdEye. In this talk, they describe their experience on-boarding client performance data (RUM) for LinkedIn pages and apps onto ThirdEye, and the lessons learned. Then, they give an overview of ThirdEye, focusing on the features developed to provide an end-to-end monitoring experience, starting from data-driven anomaly detection, alert tuning, to root cause investigation. The lessons learned and best practices will be useful to any engineering or operations team.
iOS VM and loader considerations
Rico Mariani, Software Engineer, Facebook
Rico discusses how the iOS Virtual Memory system affects the startup time of applications and how the loader’s interactions with the Objective-C programming model creates some unexpected problems as well as opportunities for improvement.
Mobile performance testing in real user conditions
Brien Colwell, CTO & Co-Founder, Headspin
Brien introduces a new test infrastructure that can run automated mobile performance tests continuously in real user conditions — networks and locations — and capture data about the network, video, audio, and device for performance analysis. He covers how they designed the infrastructure to be nimbly deployed, challenges with reliability, and technical decisions in implementing the infrastructure. In addition, he explains an analysis framework they developed to export areas of interest in the data, including network and video heuristics. Finally, he walks through user case studies to show how the framework is used to diagnostics real-world performance issues in the network, video streaming, and more.
The Magic Modem: A global network emulator
Guy Cirino, Senior Performance Engineer, Netflix
Effortless travel anywhere in the world, experiencing your app performance exactly as your users do. Guy demonstrates how to build a Magic Modem that simulates the connectivity of any ISP during development and design stages.
Scaling Alibaba’s real time infrastructure for Global Shopping Holiday
Xiaowei Jiang, Senior Director, Alibaba Group
Alibaba’s 11/11 Global Shopping Festival, hosted annually on November 11, is the world’s largest 24-hour online shopping event. On this day, they process trillions of events in real time to provide people the best experience on their e-commerce platform. In this talk, Xiaowei dives into the scale and performance challenges they faced creating a real-time infrastructure, which included bottlenecks in the network layer, scheduling, and a distributed file system. He also talks about the tools used to identify and analyze these problems.
Static resources at Facebook
Nick Gavalas, Software Engineer, Facebook
Despite the rapid growth of mobile apps, Facebook on the web remains an important interface for users. Efficient static resources delivery mechanisms are an essential part of the web platform. Learn how Facebook manages static resources at scale to enable engineers to move fast, write less code, run experiments, and support new paradigms while providing the best possible user experience. Nick talks about the challenges they’ve faced, the techniques they used to address them and the lessons they learned along the way.