Recently, we hosted Data @Scale, an invitation-only technical conference for engineers working on large-scale storage systems and analytics. Facebook’s Seth Silverman, engineering manager, and Laney Zamore, software engineer, kicked things off at the State Room in downtown Boston.
The event brought together engineering leaders from Akamai, Datadog, DataXu, Facebook, Google, HubSpot, InfluxData, OM1, Starburst, TCB Analytics, and Wayfair to discuss the challenges associated with building and operating data systems that serve millions or even billions of people. More than 300 attendees gathered to hear talks on topics such as data deletion, protecting patient privacy, machine learning on Kubernetes, and building data pipelines at scale.
Summaries and videos of some of the presentations from the event are below. If you’re interested in joining the next event, visit the @Scale website or join the @Scale community.
Protecting Patient Privacy While Using Real-World Evidence
Philip Wickline, Chief Technology Officer at OM1
The growing availability of health care data is changing the face of clinical research and the practice of medicine. Philip explores the challenges that OM1 faces in protecting the privacy of patients while working with massive amounts of patient data. He provides an overview of the ways patient data is represented and the solutions OM1 employs for working with large-scale health care data sets while still preserving patient privacy.
Balancing Flexibility and Control with Database Deployments
Tanya Cashorali, Chief Executive Officer at TCB Analytics
Tanya walks through several big data environments that exhibit various attributes. She examines the positive and negative effects of organizational constraints around technology choices that drive decisions regarding on-premises vs. cloud-based infrastructures, different database technologies, and deployment strategies. She concludes with recommendations for achieving a balance of flexibility and control.
Lessons and Observations Scaling a Timeseries Database
Ryan Betts, Director of Platform Engineering at InfluxData
Ryan presents several lessons learned while scaling InfluxDB across a large number of deployments — from single-server open source instances to highly available, high-throughput clusters. He walks through a number of failure conditions that informed subsequent design choices, discusses trade-offs between monolithic and service-oriented database implementations, and then closes with personal experiences implementing multiple-query processing systems.
Leveraging Sampling to Reduce Data Warehouse Resource Consumption
Gabriela Jacques Da Silva, Software Engineer at Facebook
Donghui Zhang, Software Engineer at Facebook
The volume of data processed by Facebook’s analytics workload has been rapidly increasing, resulting in greater compute and storage demands. Gabriela and Donghui share their work using sampling as a technique to offset such demand while still providing good approximate query results. They also discuss the challenges that this poses to approximate computation, such as the need to consider uncertainty propagation when calculating aggregated metrics. Finally, they show the benefits in terms of resource consumption in both compute and storage.
Voting with Witnesses the Apache Cassandra Way
Ariel Weisberg, PMC Member at Apache Cassandra
NOTE: This session has no recording.
Ariel explores the trade-offs and benefits of introducing Transient Replication, which is an adaptation of Witness Replicas, into Apache Cassandra. He starts with an overview of existing replication techniques and explains how Transient Replication and an optimization called Cheap Quorums facilitate up to 50% disk space and compute savings under non-failure conditions. He concludes with a view into techniques for mitigations that enable Transient Replication to perform well even under failure conditions.
Deleting Data @ Scale
Ben Strahs, Software Engineer, Privacy & Data Use at Facebook
Deletion is critical to helping people control their data. It has unique technical challenges at scale, including managing deletion across distributed systems and building in mechanisms to confirm completeness and accuracy. In this talk, Ben shares Facebook’s Deletion Framework, a system Facebook built to automatically detect gaps, ensure completeness, and make sure the correct data is deleted.
Scaling Data Plumbing at Wayfair
Ben Clark, Chief Architect at Wayfair
Ben describes how Wayfair transformed its data plumbing components from a hodgepodge of underengineered components into a scalable infrastructure capable of handling the superlinear growth of its Kafka clusters and destination data stores. He outlines the factors that led to the decision to rewrite components and finishes with an overview of Tremor, a traffic shaper and router that is replacing logstash.
Kubeflow: Portable Machine Learning on Kubernetes
Michelle Casbon, Senior Engineer at Google
Michelle presents Kubeflow, a framework on Kubernetes that provides a single, unified tool for running common processes such as model training, evaluation, and serving, as well as monitoring, logging, and other operational tools. She discusses three challenges in developing ML applications: scalability, portability, and composability. She wraps up with a demo featuring a simple use case running locally, on premises, on a cloud platform, and using specialized GPU hardware.
How DataXu Built a Cloud-Native Warehouse
Suchi Raman, Director, Product Development at DataXu
Suchi shares the lessons learned during DataXu’s multiyear journey to the cloud. She starts with an overview of the company’s previous on-premises analytics infrastructure and how it scaled this to its current “cloud native” warehouse architecture using Glue Data Catalog, Athena (Presto-as-a-service), and Lambdas on AWS. She wraps up with a discussion on how DataXu leveraged AWS spot instances to significantly reduce the company’s daily operational costs.
Migrating Elasticsearch Instances at Scale
Patrick Dignan, Technical Lead at HubSpot
Patrick walks through the evolution of the Elasticsearch infrastructure at HubSpot. He starts by describing the challenge HubSpot faced when increasing the security of its data systems due to GDPR. From there, he explores the techniques used by the data infrastructure team to seamlessly migrate teams’ indices from the old cluster to the new secure cluster. By moving to a centrally managed ingestion pipeline, HubSpot significantly reduced cluster migration time and, as a result, improved testability, maintainability, and reliability.
Presto: Pursuit of Performance
Andrii Rosa, Software Engineer at Facebook
Matt Fuller, VP of Engineering at Starburst
Andrii and Matt present the Presto Cost-Based Optimizer (CBO), recently introduced by Starburst, along with a case study on integrating the Presto CBO at Facebook scale. Matt provides a brief overview of the Presto architecture, followed by a discussion of how joins work in Presto and how CBO can improve the efficiency of joins through automatic join type selection, automatic left/right side selection, and join reordering based on selectivity estimates and cost. In the second half of the talk, Andrii explores a new mechanism in Presto that computes statistics seamlessly and efficiently, ensuring that all Presto-generated data is ready for CBO without any extra manual steps. They then discuss future work enhancing the CBO and statistics collection in Presto.
Palisade: Overload Protection for Data Analytics and Storage Systems
Aniruddha Bohra, Principal Architect at Akamai
Aniruddha presents two systems — Palisade and Akamill — that were built at Akamai to address the challenges of safely collecting large volumes of data at low latency for data processing and analytics. He starts with an overview of Palisade, an overload protection system for reporting and analytics data processing infrastructures. He then discusses Akamill, a flexible stream processing system used by Palisade that provides buffering, connection pooling, and transformations to support near real-time data collection for applications. He concludes with a view into how the two systems combine to smooth out spikes, using thinning and throttling.
Building Highly Reliable Data Pipelines at Datadog
Jeremy Karn, Staff Data Engineer at Datadog
Jeremy walks through best practices used at Datadog, ensuring that trillions of data points are processed every day. He starts by describing the technology stack and the tools used for authoring and running Datadog’s pipelines. He then talks about the use of ephemeral clusters, the benefits of job isolation, and the ability to scale the clusters based on job needs. Finally, he discusses the mechanisms used to recover quickly in the face of unavoidable conditions, such as hardware failures, increased load, upstream delays, and bad code changes.