Building applications and services that scale data to thousands or even tens of thousands of machines to serve millions of users presents a complex set of engineering challenges. This week, hundreds of engineers gathered in Seattle for the Data@Scale event for discussion and collaboration around better solutions for scaling data storage and processing.
The event featured speakers from leading technology companies, including Facebook, Microsoft, Dropbox, Twitter, Tableau, Dato, and Backblaze. Distributed data storage, query, visualization, search, and machine learning were among the topics covered.
Check out the videos from the event below. If you’re interested in joining the next event, please reach out to the @Scale team on the @Scale Facebook page.
After the introductions, Eric Hwang, software engineer at Facebook, kicked off the presentations with a talk on Presto, an open source distributed SQL query engine optimized for ad-hoc analysis at interactive speed. Hwang stressed the importance of creating a reliable engine for end users. Presto was designed and written from the ground up for running interactive analytic queries against big data sources.
The second speaker was Pawel Terlecki, engineering manager at Tableau, who discussed visualization-oriented data processing. In this talk, he addressed the architecture of key components in the cost of visualization generation, such as Tableau Data Engine and Data Server, as well as recent performance improvements in Tableau 9, as rapid increases in data volumes and complexity of applied analytical tasks pose a big challenge for visualization solutions. Terlecki talked about making experiences highly interactive to keep users engaged.
Sundaram Narayanan, engineer at Twitter, presented on LogLens, a service from Twitter that provides indexing, search, and visualization of service logs in real time. LogLens improves the search experience for users by making it easier to find patterns in logs generated by services running on hundreds of machines. Narayanan discussed why this was created, how it was built, and the key challenges he came across in building the system.
Samir Goel of Dropbox talked about Firefly, an index and search system designed to handle the more than one billion adds and edits users make to Dropbox every day. Dropbox’s engineers were finding it a challenge to organize its tremendous number of files. Goel discussed the design, complexities, and motivation behind Firefly, which is private, scalable, and fast.
Yucheng Low, one of the co-founders of Dato/GraphLab, did a live demo of Précis, the scaling and resource utilization system that is the key behind Dato’s technology stack. Précis was created for resource efficiency and to maintain high performance with a small number of machines.
Dharma Shukla, a founder of the DocumentDB project, spoke after the break for lunch. Azure DocumentDB is Microsoft’s cloud-based NoSQL database for storing and serving up information for applications in JSON document format at internet scale, which is now available to Azure developers. Shukla gave an overview of the DocumentDB system along with details of the indexing subsystem, including document representation, query language support, the index implementation methods based on lock-free and log-structured technology, as well as early production experiences.
Muthu Annamalai from Facebook spoke about ZippyDB, a highly reliable key/value memory cache service used by many Facebook services. Annamalai spoke about the architecture of ZippyDB, the additional services built on top to make it even more useful beyond a traditional key-value store. This makes it ideal to serve the varied needs of Facebook’s applications. He also discussed ZippyDB’s ability to give application services tight control over data placement.
Microsoft’s Naresh Sundaram presented on Outlook Service, a key pillar of the Office 365 suite, and discussed new ways of thinking about and architecting cloud-scale services. Sundaram talked about some of the unconventional design decisions that were made, as well as what he learned from the journey.
Facebook software engineer Kestutis Patiejunas presented on Facebook’s two multi-exabyte cold storage systems that he helped build. The first uses hard drives; the other system, which he is currently working on, uses robotic loaders and Blu-ray discs. In his presentation, Patiejunas talked about building the systems, optimizing each for price and durability, and how the team challenged itself to revisit the entire stack in order to come up with a new solution.
Aaron Ogus, a partner development manager of the durable storage systems on which all of Windows Azure Storage is built, presented on the current state and evolution of media types, including HDD, flash, and archival media, and the best way to store them. He also talked about the amount of power and space, and the rough infrastructure costs to deploy a Zeta Byte of storage in each storage class — as well as the requirements and what the future might hold for technology improvements.
Backblaze’s Brian Beach closed out the presentations for the day. He discussed Backblaze Vaults, the company’s new distributed storage system that spreads files out across multiple servers and allows availability even if some servers are down. Beach discussed the hardware used in the system, Backblaze Storage Pod, an open-sourced design that holds 45 disk drives in a 4U enclosure, which makes it durable, scalable, and performant while improving availability and operability.