Networking solutions are critical for building applications and services that serve billions of people around the world. Building and operating such large-scale networks often present complex engineering challenges to solve. At Networking @Scale 2019, attendees gathered to hear engineers from Amazon, Cloudflare, Facebook, Google, Microsoft, and Netflix discuss these challenges.
This year’s conference focused on a theme of reliable networking at scale. Speakers talked about various systems and processes they have developed to improve overall network reliability or to detect network outages quickly. They also shared stories about specific network outages, how they were immediately handled, and some general lessons for improving failure resiliency and availability. Networking @Scale 2019 also included an Inclusion @Scale session, sponsored by Powerful Women in Tech, in which attendees were encouraged to become allies and to help make workplaces more inclusive.
If you missed the event, you can view recordings of the presentations below. If you are interested in future events, visit the @Scale website, follow the @Scale Facebook page, or join the Networking @Scale attendees Facebook group.
Secure reliability: Tales from mysterious platforms
Jade Auer, Production Engineer, Facebook
Jose Leitao, Network Operations Engineer, Facebook
Infrastructure is a crucial component of maintaining system availability. Jade and Jose examine the role security plays in the reliability of infrastructure, accounting for issues that have not been considered by design engineers and cannot be solved by software. They cover the concept of reliability design and explore the goals and trade-offs in architecting design systems. In doing so, they touch on topics and examples around incident response and analysis. Drawing from an exploration of Facebook’s optical platforms, they further convey factors that can hamper the reliability of design alongside services provided. Jade and Jose conclude their presentation with key takeaways and recommendations that participants can take back to their respective organizations.
Failing last and least: Design principles for network availability
Amin Vahdat, Vice President, Google
The network is among the most critical components of any computing infrastructure. It is an enabler for modern distributed systems architecture with a trend toward ever-increasing functionality and offloads moving into the network. As such, it must continually be expanded and reconfigured to deploy compute and storage infrastructure. Most important, the network must deliver the highest levels of reliability. Drawing from his experience building some of the largest networks at Google, Amin discusses the importance of network reliability, the leading causes of failure, and the design principles key to delivering necessary levels of reliability. Amin charts a course toward the potential for common community infrastructure as crucial to advancing network reliability.
BGP++ deployment and outages
Jingyi Yang, Software Engineer, Facebook
With Facebook’s increasing user base (around 2.5 billion monthly active users), its DC network is growing fast and FBOSS switches are getting provisioned daily, resulting in an urgent need for a scalable and reliable routing solution. With existing open source BGP solutions, users have to deal with a lot of unused networking features and complexity in managing them. Jingyi discusses Facebook’s routing agent BGP++, the motivations and trade-offs for building and scaling a new in-house BGP agent, and how the team deployed the agent at scale. In particular, she looks at challenges faced during deployment and efforts to expand testing and push tooling and performance.
Preventing network changes from becoming network outages
Dave Maltz, Distinguished Engineer, Microsoft
Network changes must be executed without compromising on production traffic, making it important for every change to be thoroughly tested before its implementation. However, at cloud scale, there is no separate full-scale network that can be used to test these changes. Dave describes how Microsoft built the Open Network Emulator (ONE) system to be bug-compatible with the software running on its production switches, further integrating the emulator into various processes to validate and deploy all changes. He also shows concrete examples of the benefits of using ONE to run through major network migrations in emulation, ahead of the actual event.
What we have learned from bootstrapping 220.127.116.11
Marek Vavrusa, Systems Engineer, Cloudflare
When launching the public recursive DNS service on April 2018, engineers at Cloudflare had no idea how useful it would be. Marek describes what has been learned from launching the public recursive DNS service, covering problems encountered in bootstrapping the public recursive DNS service, in context to the evolution of its service architecture. He also conveys what should have been done differently in light of what is known after more than a year of operating a public recursive DNS service.
Operating Facebook’s SD-WAN network
Shuqiang Zhang, Software Engineer, Facebook
Palak Mehta, Network Infrastructure Engineer, Facebook
Facebook’s centrally controlled wide-area network connects data centers serving billions of users. Its design faced several challenges, resulting in a holistic, centrally controlled network solution called Express Backbone (EBB). Shuqiang and Palak focus on operational challenges and solutions discovered on EBB. They describe high-level designs and recent updates, then dive into reliability challenges and their achievement of operational reliability with software, specifically synchronizing sharding and replication with the network design. They capture how they solved challenges specific to software-defined networks, shedding light on how a multidisciplinary Facebook team of production engineers and software engineers operates Facebook’s software-defined network.
Getting a taste of your network
Sergey Fedorov, Senior Software Engineer, Netflix
Sergey describes a client-side network measurement system called Probnik, and discusses how it can be used to improve performance, reliability, and control of client-server network interactions.
Enforcing encryption @scale
Mingtao Yang, Software Engineer, Facebook
Ajanthan Asogamoorthy, Software Engineer, Facebook
Facebook runs a global infrastructure that supports thousands of services, with many new ones spinning up daily. Protecting network traffic is taken very seriously, and engineers must have a sustainable way to enforce security policies transparently and globally. One requirement is that all traffic that crosses “unsafe” network links must be encrypted with TLS 1.2 or above using secure modern ciphers and robust key management. Mingtao and Ajanthan describe the infrastructure they built for enforcing the “encrypt all’ policy on the end hosts, as well as alternatives and trade-offs encompassing how they use BPF programs. Additionally, they discuss Transparent TLS (TTLS), a solution that they’ve built for services that could not enable TLS natively or could not easily upgrade to a newer version of TLS.
Safe: How AWS prevents and recovers from operational events
Colm MacCarthaigh, Senior Principal Engineer, Amazon
A top priority at AWS is security and operational excellence. Drawing from 25 years of experience operating Amazon.com and AWS, engineers have refined procedures and techniques for reducing the incidence, duration, and severity of operational events. Colm shares many lessons that apply before, during, and after an event, and dives into their SAFE protocol for incident response and their Correction of Error process for rigorously analyzing events to ensure lessons are learned and errors are not repeated.