Operating systems that serve millions (or even billions) of people can present unprecedented, complex engineering challenges. Last year, we launched the Systems @Scale conference to bring together engineers from various companies to discuss those challenges. At this year’s event, attendees gathered to hear speakers from Facebook, LinkedIn, Uber, and other companies discuss innovative solutions for large-scale information systems.
If you missed the event, you can view recordings of the presentations below. If you are interested in future events, visit the @Scale website or follow the @Scale Facebook page.
Benjamin Reed, Assistant Professor, San Jose State University
Ben discusses the genesis of Apache ZooKeeper and the lessons learned along the way. What began as a simple idea to create a coordination core that could be used by many different applications at Yahoo! has grown into an Apache open source project used by many companies and projects. Ben talks about how the ideas behind ZooKeeper were conceived, its initial reception, and their initial experiences rolling out to the world.
Apache Hive: From MapReduce to enterprise-grade big data warehousing
Jesus Camacho Rodriguez, Principal Software Engineer, Cloudera
In this talk, Jesus describes the innovations on the journey from batch tool to full-fledged SQL enterprise data warehousing system. In particular, he shows how the community expanded the utility of the system by adding row-level transactional capabilities required for data modifications in star schema databases, introducing optimization techniques that are useful to handle today’s view hierarchies and big data operations, implementing the runtime improvements necessary to bring query latency and concurrency into the realm of interactive operation, and laying the groundwork for using Apache Hive as a relational front end to multiple storage and data systems. All these enhancements were introduced without ever compromising on the original characteristics that made the system popular.
Delos: Storage for the Facebook control plane
Mahesh Balakrishnan, Software Engineer, Facebook
Jason Flinn, Visiting Professor, Facebook, and Professor, University of Michigan
Delos is a new low-dependency storage system for control plane applications at the bottom of the Facebook stack. It supports a rich API (transactions, secondary indices, and range queries) and quorum-style guarantees on availability and durability. Delos is not just a storage system; it’s also an extensible platform for building replicated systems. Delos can be easily extended to support new APIs (e.g., queues or namespaces). Delos can also support entirely different ordering subsystems and switch between them on the fly to obtain performance and reliability trade-offs. As a result, new use cases with various APIs or performance requirements can be satisfied via new implementations of specific layers rather than by a wholesale rewrite of the system.
Delos: Simple, flexible storage for the Facebook control plane
Accordion: Better memory organization for LSM key-value stores
Eshcar Hillel, Senior Research Scientist, Verizon Media
Log-structured merge (LSM) stores have emerged as the technology of choice for building scalable write-intensive key-value storage systems. Though inherent to the LSM design, frequent compactions are a major pain point because they slow down data store operations, primarily writes, and increase disk wear. Another performance bottleneck in today’s state-of-the-art LSM stores, in particular ones that use managed languages like Java, is the fragmented memory layout of their dynamic memory store. In this talk, Eshcar shows that these pain points may be mitigated via better organization of the memory store. She also presents Accordion — an algorithm that addresses these problems by reapplying the LSM design principles to memory management. Accordion is implemented in the production code of Apache HBase, where it was extensively evaluated. Eshcar demonstrates Accordion’s double-digit performance gains versus the baseline HBase implementation and discuss some unexpected lessons learned in the process.
Observability infra, Uber and Facebook
Yuri Shkuro, Software Engineer, Uber
Michael Bevilacqua-Linn, Software Engineer, Facebook
Distributed tracing systems are a tried-and-true tool for understanding systems at scale, ranging back over a decade to early research systems like X-Trace and Magpie, and popularized in industry with Google’s Dapper. Both Uber and Facebook operate large-scale distributed tracing systems, but each has a different focus. Uber’s Jaeger is used primarily as an observability tool, which gives engineers insight into failures in their microservices architecture, while Facebook has largely used its tracing system, Canopy, to get a detailed view of its web and mobile apps, including the creation of aggregate data sets with a built in trace-processing system. In this talk, Yuri and Michael walk through Canopy’s built-in trace processing, as well as Uber’s use of traces for more automated root cause analyses of distributed failures.
Continuous deployment at Facebook scale
Boris Grubic, Software Engineer, Facebook
Fangfei Zhou, Software Engineer, Facebook
Continuous deployment is an important requirement for moving fast, given the scale at which Facebook operates. This presentation describes how Facebook solves different aspects of the problem and how all the components are connected to provide an efficient developer experience. Once a developer commits a change, our infrastructure automatically builds the binary, tests it in various ways, and safely deploys it across the fleet. Enabling this workflow for thousands of microservices involves a delicate balance of trade-offs, and so the presentation also calls out the design considerations that guided the evolution of the system over the years and what continuous deployment means for the future.
Elaine Arbaugh, Senior Software Engineer, Affirm
As infrastructure grows, it’s critical to have observability into system performance and reliability in order to identify any current issues or potential future bottlenecks. In this talk, Elaine discusses how Affirm’s custom metrics, monitoring, and alerting systems work; how we’ve scaled them as our traffic and engineering teams have grown rapidly; and examples of scaling-related issues we’ve identified with them. She also discusses the instrumentation we’ve added around SQL queries, which has helped identify several issues that were causing excessive load on our servers and MySQL databases, as well as the tooling we’ve added to help devs optimize their queries. Elaine goes into detail about specific database and machine-level issues Affirm has faced, and how detection, diagnosis, escalation, and resolution were handled.
Enabling next-generation models for PYMK @Scale
Peter Chng, Senior Software Engineer, LinkedIn
Gaojie Liu, Staff Software Engineer, LinkedIn
The People You May Know (PYMK) recommendation service helps LinkedIn’s members identify other members that they might want to connect to and is the major driver for growing LinkedIn’s social network. The principal challenge in developing a service like PYMK is dealing with the sheer scale of computation needed to make precise recommendations with a high recall. This talk presents the challenges LinkedIn faced when bringing its next generation of models for PYMK to production. PYMK relies on Venice, a key-value store, for accessing derived data online in order to generate recommendations. However, the increasing amount of data that had to be processed in real time with our next-generation models required us to collaborate and codesign our systems with the Venice team to generate recommendations in a timely and agile manner while still being resource efficient. Peter and Gaojie describe this journey to LinkedIn’s current solution with an emphasis on how the Venice architecture evolved to support computation at scale, the lessons learned, and the plan to tackle the scalability challenges for the next phase of growth.
Scaling cluster management at Facebook with Tupperware
Kenny Yu, Software Engineer, Facebook
Tupperware is Facebook’s cluster management system and container platform, and it has been running in production since 2011. Today, Tupperware manages millions of containers, and almost all backend services at Facebook are deployed through Tupperware. In this talk, Kenny explores the challenges we encountered over the past eight years as we evolved our system to scale to our global fleet. He discusses scalability challenges we faced, stateful services and how we support them, our approach for opportunistic compute, and upcoming challenges we are tackling next.
Efficient, reliable cluster management at scale with Tupperware
Preemption in Nomad — a greedy algorithm that scales
Nick Ethier, Software Engineer, Hashicorp
Michael Lange, Software Engineer, Hashicorp
Cluster orchestrators manage and monitor workloads that run in large fleets (tens of thousands) of shared compute resources. This talk is a technical deep dive into the challenge of keeping business-critical applications running by implementing preemption. The talk covers the challenges of implementing preemption for heterogeneous workloads, the algorithm Hashicorp designed, and how it is used. Nick and Michael conclude with remaining challenges and future work.
Disaster recovery at Facebook scale
Shruti Padmanabha, Research Scientist, Facebook
Justin Meza, Research Scientist, Facebook
Facebook operates dozens of data centers globally, each of which serves thousands of interdependent microservices to provide seamless experiences to billions of users across the family of Facebook products. At this scale, seemingly rare occurrences, from hurricanes looming over a data center to lightning striking a switchboard, have threatened the site’s health. These events cause large-scale machine failures at the scope of a data center or significant portions of it, which cannot be addressed by traditional fault-tolerance mechanisms designed for individual machine failures. Handling these failures requires us to develop solutions across the stack, from placing hardware and spare capacity across fault domains to being able to shift traffic smoothly away from affected fault domains to rearchitecting large-scale distributed systems in a fault domain-aware manner. In this talk, Shruti and Justin will describe principles Facebook follows for designing reliable software, tools we built to mitigate and respond to failures, and our continuous testing and validation process.