This week we held our third annual @Scale conference in San Jose, where engineers from hundreds of companies gathered to openly discuss the challenges of building apps and systems that scale to millions or billions of people and to collaborate on the development of new solutions.
The lineup this year featured speakers from Airbnb, Amazon, Databricks, Dropbox, Facebook, Flipkart, GE, GitHub, Google, Instagram, LinkedIn, Microsoft, Netflix, NVIDIA, Oculus, PayPal, Pinterest, Quip, Slack, Spotify, Square, Uber, and YouTube. These companies alone host nearly 2,800 open source projects that are collectively followed by 1.5 million engineers around the world — and that's just a fraction of the companies that are part of the @Scale community — demonstrating the reach and impact of the technologies presented over the course of the day.
Attendees heard talks across four tracks: data, dev tools and ops, mobile, and a new hot topics track focused on innovative technologies such as video and machine learning. Below are some highlights from this year's event.
Facebook's head of engineering and infrastructure, Jay Parikh, kicked off the day with a keynote about the importance of resiliency when building at scale. After Hurricane Sandy took down internet services along the East Coast, Facebook began to evaluate its own ability to keep operating in a similar circumstance. The result was “Project Storm,” a company-wide initiative to work through massive drills simulating what would happen if a data center were to experience a widespread outage. Parikh talked about the planning and tooling that not only made these drills successful but also helped bolster day-to-day operations.
Parikh also sat down with Ping Li from Accel to talk about the open adoption software movement and how the venture capital industry is embracing it.
Next up was Himagiri Mukkamala, who discussed the challenges and the opportunities that arose as GE shifted from an industrial company to a digital one. As the head of engineering on Predix, GE's cloud-based operating system for industrial applications, he thinks about how to bridge software and sensors with the machines the company is traditionally known for manufacturing. GE assets generate 1 exabyte of data per day, and the system combines this data with machine learning to predict when or how equipment such as a jet engine might fail earlier than a scheduled maintenance or replacement, allowing industrial partners to improve and extend the lifespan of its machinery.
Zstandard: Real-time data compression at scale
In a standing-room-only talk in the data track, Facebook's Yann Collet revealed Zstandard 1.0, a new open source data compression algorithm that achieves both faster compression and decompression speeds and smaller compression size than current algorithms.
Amazon Aurora: An under-the-hood view of a cloud-scale relational database service
Amazon Aurora is a unique MySQL-compatible relational database offering on AWS. Debanjan Saha shared some of the innovations behind Amazon Aurora—including its self-healing, fault-tolerant, and scale-out architecture—that have helped it achieve up to 5x higher performance versus MySQL, as well as improved availability and durability over traditional offerings.
Dev tools & ops
Blazing fast: Scaling iOS at Uber
In the dev tools track, Alan Zeino and Nick Cobb from Uber discussed how they ship new code every week with the goal of allowing only one minute of downtime per week worldwide. Along with tooling and automation to catch regressions, the company focused its efforts on migrating all code into a monolithic repository using the Buck build tool. They also announced a set of open source tools including Buck Swift support, Buck dynamic framework support, HTTP Build Cache for Buck, Ohana, and the Rides SDK.
Experimentation at scale & replicated RocksDB at Pinterest
Chunyan Wang and Bo Liu talked about how Pinterest scaled from a 50-person startup and grew its experimentation culture, relying on A/B experiments to make product decisions. As their service became more real-time, they built stateful online services on replicated RocksDB. They shared details about the RocksDB replicator's design, implementation, and performance, and hope to open-source the system soon.
Flipkart: Building with a mobile-first approach in India
Flipkart was the first Indian app to reach 50 million downloads in India. Amar Nagaram shared some of the challenges the company faced while building for the unique mobile ecosystem in India, and covered the technology investments the company made in its mobile-first approach, including providing offline support for its mobile web-browsing sites and using Project Proteus for server-defined layouts for its mobile app.
HTTP2 Push: Lower latencies around the world
One of the features of the new HTTP2 protocol is “server push,” where a server pushes resources automatically without waiting for a client request. Saral Shodhan and Ranjeeth Dasineni discuss how Facebook leveraged the technology to push photos from the server for a News Feed story and reduce round-trip times by at least 15 percent on a range of mobile networks. They also explore how the company's new client/server interaction model could improve streaming for Live video, and left attendees with practical advice on how to adopt HTTP2 Push in their own networks.
Creating and scaling Spotify's Discover Weekly playlist
Spotify curates and delivers personalized playlists to its 100 million active users every week. Edward Newett discussed how a team of 12 engineers created Discover Weekly, using implicit matrix factorization and deep learning to select tracks, and the technical challenges they faced having to update 100 million playlists on time every week.
Dropbox Infinite: A different kind of distributed file system
The amount of content people create has grown at a rate with which desktop storage cannot keep up. Dropbox's Ben Newhouse presented Project Infinite, a file system that enables people to see all the content across their entire team on their own computer and only syncing on demand. To maintain a high quality bar while deploying the filesystem across hundreds of millions of devices, the company adopted a framework called SANDD — Sleep at Night Driven Development — that continuously tests and monitors for vulnerabilities such as kernel panics, deadlocks, data loss, and more.
GPUs and deep learning deployments at scale
Deep learning models are becoming increasingly complex and have such high computation loads that it's hard to produce real-time training results. Robert Ober from NVIDIA explored the main architectural components of GPUs that make them useful for deep learning and capable of high-throughput prediction — as high as tens of exaflops per second.
In the spirit of openness embodied by the @Scale community, the day ended with a panel of engineers from Facebook, Microsoft, and Netflix, moderated by GitHub, on best practices for running high-quality open source programs. The companies shared how they support their programs internally, leverage community knowledge to improve their projects, and the challenges they face in making sure their open source products and tools are beneficial for both the business and the community.
These are just a few examples of the challenges and solutions shared by the @Scale community this year. More session videos are available here. Please join the @Scale community or visit the @Scale page to stay up to date on future events.