What the research is:
A scalable service management platform for Facebook’s stream processing service. Turbine is designed to bridge the gap between the capabilities of existing general-purpose cluster management frameworks like Tupperware and Facebook’s stream processing requirements. In production for several years now, Turbine has enabled a boom in stream processing at Facebook.
Turbine is currently deployed on clusters spanning tens of thousands of machines managing several thousand streaming pipelines, which process hundreds of gigabytes of data per second in real time. Our production experience has validated Turbine’s effectiveness in balancing workload fluctuation across clusters, predictively handling unplanned load spikes, and completing high-scale updates consistently and efficiently within minutes.
How it works:
Turbine features a fast and scalable task scheduler; an efficient predictive autoscaler; and an update mechanism that provides fault-tolerance, atomicity, consistency, isolation, and durability.
The scheduling mechanism provisions and manages stream processing tasks using a two-level placement mechanism. It first places shards onto containers using Facebook’s general-purpose shard manager and then places stream processing tasks on those shards using a hashing mechanism. Shards are periodically load balanced, and Turbine provides mechanisms to safely reschedule stream processing tasks when that happens. Turbine also performs failure handling to ensure that failures do not cause data corruption, loss, or duplication of data processing.
Turbine’s autoscaler defines a proactive and preactive mechanism for automatically adjusting resource allocation in multiple dimensions (e.g., CPU, memory, disk). The autoscaler estimates the resources needed for a given stream processing job and then scales up or down the number of stream processing tasks, as well as the resources allocated per task, to achieve the service-level objectives (SLOs). Based on the effect of these scaling decisions (as well as historical workload patterns), the autoscaler can also iteratively adjust the original resource estimates.
Additionally, Turbine provides an atomic, consistent, isolated, durable, and fault-tolerant application update mechanism. This capability is important because multiple actors (provisioner service, autoscaler, human operator) may concurrently update the same stream processing job, and the system must ensure that these updates are isolated and consistent. Turbine uses a hierarchical job configuration structure and merges updates from multiple actors based on which job takes precedence. By separating planned updates from actual updates, Turbine provides atomic, durable, and fault-tolerant updates. A state sync service performs synchronization between the expected and running job configurations, rolling back and retrying if the update fails.
Why it matters:
The past decade has seen rapid growth of large-scale distributed stream processing in the industry. Similar trends are occurring at Facebook with many use cases adopting distributed stream processing for search effectiveness signals, recommendation-related activities, low-latency analysis of site content interaction, and other uses. Running these applications in real-time at Facebook scale requires strict SLOs. For instance, many stream processing applications at Facebook require a 90-second end-to-end latency guarantee.
A management platform like Turbine is beneficial in maintaining the strict SLOs demanded by these use cases. Turbine is a good example of how to engineer a solution with loosely coupled microservices responsible for answering what to run (job management), where to run (task management), and how to run (resource management). This work yields a highly scalable and highly resilient management infrastructure capable of supporting a large number of streaming pipelines processing a large amount of data with minimal human oversight.
We hope that the lessons we’ve learned while developing and testing Turbine will help other stream processing frameworks in the industry and motivate further research to continue advancing the state of the art in this area.
Read the full paper:
Turbine: Facebook’s service management platform for stream processing
We’d like to thank Yuan Mei, Luwei Cheng, Vanish Talwar, Michael Y. Levin, Gabriela Jacques-Silva, Nikhil Simha, Anirban Banerjee, Brian Smith, Tim Williamson, Serhat Yilmaz, Weitao Chen, and Guoqiang Jerry Chen for their work on Turbine.