Building and operating systems that serve billions of people can present unprecedented and complex engineering challenges. The second Systems @Scale event of 2019 was held in New York, where engineers gathered for a day of technical talks focused on observability: tools, techniques, and approaches that can be applied to observing the state of complex distributed systems. Speakers from various companies, including Etsy, Facebook, Google, LightStep, Squarespace, and Uber, discussed a wide spectrum of topics, from the concept of deep systems to novel visualizations of complex distributed traces to the challenges of effecting cultural change in organizations to foster observability.
If you missed the event, you can view recordings of the presentations below. If you are interested in future events, visit the @Scale website or join the @Scale community.
Keynote: How “deep systems” broke observability … and what we can do about it
Ben Sigelman, CEO and Cofounder, LightStep
Large-scale systems aren’t simply larger versions of small-scale systems — they are something completely different. Enter the deep system. Deep systems aren’t merely larger; they’re orders of magnitude more elaborate. In fact, they are so much more structurally complex that they deserve their own category. They are distributed, they are layered, they are concurrent, they are multi-tenant, they change continuously, and they are extremely difficult to observe. Once ordinary systems become deep systems, cookie-cutter observability falls over. The transition from a shallow system to a deep system is often abrupt: You don’t realize you are operating a deep system until the day you receive a bug report and have no idea which service could be responsible.
Ben walks through the design and natural evolution of deep systems and explains why conventional metrics-first approaches to observability break down along the way. He also shows why distributed tracing should become the backbone for observability in deep systems at scale.
A tale of two performance analysis tools
Helga Gudmundsdottir, Software Engineer, Facebook
Identifying root causes of regression alongside opportunities for optimization are important challenges for engineers who are trying to achieve their performance goals. Sophisticated analysis and visualization tools allow engineers to gain insights and draw conclusions about collected performance and observability data. Helga shares her experience investigating performance regressions using two of Facebook’s in-house performance analysis tools, CV and Tracery.
Comprehending incomprehensible architecture
Yuri Shkuro, Software Engineer, Uber
The industry has embraced microservices as an architectural pattern, adding complexity to distributed systems. Distributed tracing has emerged as a solution for understanding what’s going on in these architectures. Yuri starts with a refresher of distributed tracing as a core observability tool in modern systems, from a single trace view to aggregate analysis. Then he conveys how Uber uses data mining, complexity reduction, and intuitive visualizations to bring real traces back into the realm of human comprehension abilities, guide the user toward actionable insights about the root cause of the outages, and reduce time to mitigation.
A picture is worth 1,000 traces
Spiros Xanthos, CEO and Founder, Omnition
Constance Caramanolis, Software Engineer, Omnition
A single trace can reveal many things: network latencies, time spent in databases, a service spinning idly, and more. However, finding the trace that demonstrates a problem in a large distributed application is very challenging. Spiros and Constance convey the impact of aggregate analysis of distributed traces and highlight its applications beyond performance troubleshooting. They demonstrate that by looking at traces in aggregate, we can eliminate the need to state and validate hypotheses, allowing answers to naturally emerge.
Service efficiency at Instagram scale
Dave Marchevsky, Software Engineer, Facebook
Pranav Thulasiram Bhat, Software Engineer, Instagram
Instagram is growing quickly, with more developers adding more features for more users. Dave and Pranav provide an overview of the profiling framework used to understand the production performance of Instagram’s web server. They share how data is processed for regression detection and general efficiency work, and walk through previous iterations of this system to understand changes and improvements.
Monarch, Google’s planet-scale monitoring infrastructure
George Talbot, Staff Software Engineer, Google
George discusses Google’s planetwide monitoring system, Monarch, highlighting the challenges and solutions the company has encountered. He conveys the impact of implementation and design on scaling, including process concurrency on overall distributed system properties, pushing queries down to data, and controlling query fanout.
Observability, it’s bigger than production
Gordon Radlein, Director of Engineering, Etsy
Discussions about observability and its uses are often rooted in the context of production services. Gordon expands the conversation around observability, exploring opportunities for organizational improvement that exist outside of production. He demonstrates that on-call health, capacity planning, client code, cloud costs, HR workflows, and even query patterns are all important aspects of a business and that they are bursting with information and opportunities when organizations are given the ability to analyze them.
Scribe observability: Monitoring a message bus at scale
Dino Wernli, Software Engineer, Facebook
Cristina Opriceana, Production Engineer, Facebook
Scribe is a flexible data transport system used widely across Facebook. Monitoring and operating a system at this scale is challenging, requiring dedicated systems with the sole goal of solving observability for Scribe. Dino and Cristina dive into two systems used to monitor Scribe’s vitals and shed light on design trade-offs required when building them.
Scaling observability data at Datadog
Jason Moiron, Software Engineer, Datadog
As systems scale, the data collected to understand behavior grows in volume and complexity. Observability systems that empower insight have conflicting challenges. For example, writes have to be high throughout to support the immense volume, but queries must be low latency to be suitable for reactive operations. Furthermore, while query traffic is repetitive in aggregate, the important queries, testing hypotheses in the crucible of an outage, are unique and unpredictable. Jason examines the difficulty of meeting these requirements under an ever-growing load and the architectural trade-offs employed to support accelerating growth.
Scuba: Real-time monitoring and log analytics at scale
Harani Mukkala, Software Engineer, Facebook
Stavros Harizopoulos, Software Engineer, Facebook
Scuba is Facebook’s platform for real-time ingestion, processing, storing, and querying of structured logs from the entire fleet of machines. Scuba makes data available for querying in less than a minute and uses a massive fanout architecture to provide query results in less than a second. Harani and Stavros discuss the monitoring, debugging, and ad-hoc analytic use cases that are enabled by Scuba’s log-everything approach. They go into further detail about the system’s architecture and the set of trade-offs they’ve had to make to scale the platform. They conclude their discussion with plans for the future and the central role they envision Scuba playing in Facebook’s observability infrastructure.
Developing meaningful SLIs for fun and profit
Alex Hidalgo, Senior SRE, Squarespace
Developing meaningful SLIs is not an easy task, but your error budgets and your SLOs are useful only if they’re informed by good SLIs. In this talk, Alex performs a deep dive on how to develop SLIs that actually reflect the journeys of your users. He starts by describing an example web service that should look familiar, identifies the low-hanging-fruit metrics you might be tempted to use, talks about the limitations of these low-hanging fruit, and concludes with some concrete examples of what useful SLIs for such a system might actually look like.