Systems @Scale Tel Aviv is an invitation-only technical conference for engineers who build and maintain large-scale systems. Approximately 300 engineers from 70 companies gathered for the third Systems @Scale conference of the year, held in Tel Aviv. Speakers from Aleph VC, Facebook, Forter, Outbrain, and Singular Labs discussed scaling tools to support growing teams, building large-scale code repositories, tracking test quality, designing systems, and managing data centers.
If you missed the event, you can view recordings of the presentations below. If you are interested in future events, visit the @Scale website or join the @Scale community.
Scaling Facebook’s data center infrastructure
Joel Kjellgren, Site Ops Director, Facebook
To run a service such as Facebook requires a highly reliable, scalable, and efficient data center infrastructure. Joel explores technological innovations that push the boundaries of physical infrastructure, allowing Facebook to scale to serve and connect billions of people around the world.
Kill the mutants — ’cause it’s about time to test your tests
Yonatan Maman, Vice President, Outbrain
Unit tests are part of our day to day, creating a need to qualitatively measure unit tests. Tests are supposed to prove the correctness of the code, even providing free registration when coupled with CI. The big question remains, what is the quality of these tests? Yonatan explores mutation testing and its adoption of the Chaos Monkey methodology to the world of unit tests. This entails injecting bugs into code to see whether the test suite covers the newly introduced bug. Creating mutations to the tested code and validating tests can help identify mutations and kill them. Yonatan acknowledges that mutation testing is not a new idea but says it was previously considered too theoretical and mainly applied in an academic context. Now that CPUs are faster and tools are better, mutation testing is gaining recognition as a practical and qualitative measurement technique.
Managing trade-offs for Data prefetching
Michal Trudler, Software Engineer, Facebook
The ability to prefetch data is an important lever in improving Facebook Lite responsiveness. It gives the perception of instant data availability served from local cache. Meanwhile, excessive prefetching can lead to wasted time and performance regressions if the data is not used. Michal explores the technical challenges faced when serving cached content to people using Facebook Lite, and demonstrates how Facebook balances data usage and resources while maximizing prefetching.
The world changed. Did our designs?
Avishai Ish-Shalom, Engineer in Residence, Aleph VC
When we build systems, our design and trade-offs reflect the different scales of the system, the speed of disks, and the latency of the network. They reflect the constraints and abilities of underlying technological systems. As technology advances, some of these assumptions have become invalid. Avishai conveys how changes in hardware technologies affect the design rationale of various systems. He highlights the importance of understanding and rethinking the design rationale, exploring new designs stemming from the new rationale.
The journey for a new ORM in Go
Ariel Mashraki, Software Engineer, Facebook
Over the course of the last year, Go became the main programming language for developing services in Facebook Connectivity, with some of them containing a complicated data model of varying types and relationships. At Facebook, engineers think about their data model in graph concepts. They’ve had positive experiences with this model internally. The lack of a proper graph-based ORM for Go led engineers to write one and open-source it. Ariel shares his experience taking this concept from idea to implementation. In doing so, he dives into some of the challenges and technical decisions they confronted in the process.
Memory analysis at scale
Erez Alon, Engineering Manager, Facebook
Facebook runs huge Java services; this applies both to the size of a single process and to the scale of its server fleet. Facebook Lite is one of these dominant Java services and serves hundreds of millions of users every month. The architecture of Facebook Lite is unique, offloading a client’s typical work to the server and causing it to evolve into a memory-bound service. This architecture provides clear advantages to Facebook Lite users and developers while also imposing difficulties on service owners for keeping the service healthy and safe from memory regressions.
For instance, even a memory regression of 1 percent has high stability and computational cost implications on its production system. It must be detected and blocked as soon as possible. Erez covers the evolution of Facebook Lite, starting from a time when it was challenged by memory regressions that made it vulnerable, through building a scalable and advanced memory analysis infrastructure, to providing high-granularity memory visibility to developers and eventually enabling them to advance its service to optimal efficiency and secure memory wins.
The challenge to align data points at scale
Ron Konigsberg, Chief Architect, Singular
Singular combines data pulled periodically from 2,500-plus sources and streamlined data received in real time. Joining these data sets creates a few unique challenges. Among these are frequent changes in periodic data pulled from different sources, affecting real-time data retroactively. Another challenge includes the untimely arrival of periodic and real-time data, navigating the need to align and match them. Ron shares some of the tricks Singular has used to keep the data aligned at scale, including separating frequently and infrequently changed data to streamline alignment, detecting changes in the data using consistent hashing and storing data to efficiently apply changes with bz2 inline-block edit optimization.
Monorepos: Moving fast in a huge repository
Durham Goode, Software Engineer, Facebook
Keeping all of your code in a single repository has huge benefits but comes with equally huge obstacles. Durham discusses the challenges Facebook has faced within its codebase, and how it’s radically extending its source control system in order to enable the entire ecosystem of developer tools to remain fast in the face of growth at scale. He introduces the concept of a monorepo, providing a rough idea of Facebook’s repository scale, and the problems it causes in development. He concludes with a few source control innovations they’ve made to tackle these challenges.
Operating low-latency fraud prevention systems at scale
Re’em Bensimhon, Principal Engineer, Forter
Forter engineers are on a mission to build the foundation for a more credible internet by blocking fraudsters and abusers on e-commerce platforms. To achieve that, they need to take millions of high-risk, low-latency decisions per day while processing billions of events. Re’em shares their efforts, doing all this with a very lean and mean R&D team. He explains how they’ve had to invent many solutions from the ground up, and he shares the insights they’ve gained along the way.
Detection & alerting at Facebook: Detecting significant metric movements at Scale
Ben Southgate, Software Engineer, Facebook
Monitoring metrics for any significant movements is key to detecting problems with systems and products. Ben provides an overview of Facebook’s detection and alerting framework. He covers the scale in the number of time series they monitor, the different detection algorithms they offer (rule-based and ML-based), and the ability to auto-slice data along multiple dimensions to identify deeper issues. Deriving signal without being inundated with noise is important at Facebook scale, so they’ve built tools to empower teams to maintain a high signal-to-noise ratio. To cater to future scale needs, they’re currently focused on automatic monitoring: proactively logging and monitoring the right metrics for different artifacts, proactively analyzing any flagged events, and hopefully predicting potential critical incidents.