Recently, we hosted our first-ever Systems @Scale conference. Held at Facebook’s Menlo Park campus, the event brought engineers from various companies to discuss the challenge of managing large-scale information systems serving millions or even billions of people.
More than 300 attendees gathered to hear Facebook VP of Engineering Jay Parikh’s keynote about how we’ve gotten faster while growing larger, followed by a day of stellar talks by experts from Amazon, Facebook, Google, Lyft, Oath, and Shopify. Open discussions included the building blocks for managing stateful applications, novel approaches to scaling systems, and lessons learned from software rollouts.
To watch the presentations, view the videos below. If you’re interested in joining the next event, visit the @Scale website or join the @Scale community.
Keynote
Jay Parikh, Vice President of Engineering, Facebook
Facebook now updates its core software at least 10 times more often than it did 10 years ago, and those updates happen faster now, despite huge growth in the number of servers, engineers, and users. Jay explains how Facebook manages to keep building and adding features to a product that makes an impact on more than a billion people daily — without disrupting operations.
Geo-Replication in Amazon DynamoDB Global Tables
Doug Terry, Senior Principal Technologist, Amazon AWS
DynamoDB is a NoSQL cloud database in which tables can scale from a few items to hundreds of terabytes. A recently launched feature adds support for global tables that are replicated in AWS regions throughout the world. Doug talks about the challenges of designing this fully managed global service while maintaining DynamoDB’s key properties of elastic scale, high availability, and predictable performance.
Holding It in @Scale: The systemd-tails on Containers, Composable Services, and Runtime Fun Times
Madelaine Boyd + Lindsay Salisbury, Software Engineer + Production Engineer, Facebook
Madelaine and Lindsay talk about Facebook’s deployment of images, give the dish on btrfs, and explain how systemd works as a low cognitive overhead runtime for containers: It walks like a namespace, talks like a host, and manages processes and resources like a runtime. They also explain why composable, self-contained services offer advantages over inheritance and layering models.
Kubernetes Application Migrations: How Shopify Moves Stateful Applications Between Clusters and Regions
Ian Quick, Production Engineering Lead, Shopify
Maintenance and resiliency require that applications move between clusters and regions, and eventually to cloud providers. Coordinating failovers within cluster resources and out-of-cluster resources requires careful orchestration and trade-offs. Ian shares how Shopify manages high availability and maintenance through low-downtime migrations for stateful applications.
Resolving Outages Faster with Better Debugging Strategies
Liz Fong-Jones, Staff Site Reliability Engineer, Google
Engineers spend a lot of time building dashboards to improve monitoring, but whenever they are paged, they still spend a lot of time figuring out what’s going on and how to fix it. Liz explains why building more dashboards isn’t the solution — using dynamic query evaluation and integrating tracing is.
Scaling Data Distribution at Facebook Using LAD
Rounak Tibrewal + Ali Zaveri, Software Engineers, Facebook
Location-Aware Distribution (LAD) is a new data distribution framework built and deployed at Facebook. Rounak and Ali describe the bottlenecks that a prior data distribution system faced and then describe the design for LAD, which leverages peer-to-peer transfers, and share lessons learned from launching it in production.
Lyft’s Envoy: Embracing a Service Mesh
Matt Klein, Software Engineer, Lyft
Over the past several years, facing considerable operational difficulties with its initial microservice deployment primarily rooted in networking and observability, Lyft migrated to a sophisticated service mesh powered by Envoy. Matt explains why Lyft developed Envoy, focusing primarily on the operational agility that the burgeoning service mesh paradigm provides, with a particular look at microservice networking observability. He also covered future directions for Envoy at Lyft.
Rebuilding the Inbox: The Corporate and Cultural Adventures of Modernizing Yahoo Mail
Jeff Bonforte, Senior Vice President of Communication Products, Data, and Research, Oath
Changing very large systems in production can be quite a challenge. Sometimes you need to radically modernize and innovate, but at the same time serve your users, release incremental features to remain competitive, and keep the business running. You might have the best team of engineers and significant resources, but making changes will still be difficult. In this talk, Jeff shares learnings from modernizing the entire Yahoo Mail Stack.
Web@FB: Releasing Facebook.com Quickly, Continuously, and Reliably
Anca Agape, Software Engineer, Facebook
Facebook’s Web Service, composed of thousands of servers running millions of lines of Hack code, is one of the largest monolithic services in existence. Anca presents some aspects of managing the web tier (the servers and software that power Facebook.com) and giving thousands of Facebook engineers a safe way to move quickly at scale. She details how Facebook pushes code out many times per day and discusses the lessons learned after a year of continuous release. She also explains the deployment of configuration changes on the web tier and the automation that empowers developers to make changes quickly yet safely. Finally, she dives into how Facebook.com is monitored to sustain high reliability in the face of rapid code, configuration, and environmental changes.
Compute-as-a-Service: Moving Toward a Global Shared Fleet
Ben Christensen, Software Engineer, Facebook
As Facebook has added more and more services, the operational complexity and amount of human intervention required to support them has increased. Ben explains how Facebook is working to evolve the systems that support these services to the next order of magnitude. He shares a vision for a global shared fleet of servers to replace the friction involved in service management.