The types of content people share on News Feed have evolved from text, to photo, to video, to rich immersive experiences like 360 and Live. As these experiences become more sophisticated, it gets more challenging to render items on screen using the traditional Android UI framework, even on today’s powerful high-end devices. On News Feed, not being able to render items quickly enough can result in awkward jitters and pauses instead of the smooth scrolling experience we want to provide.
By taking advantage of the unique capabilities offered by two Facebook open source projects — Litho and Infer — we were able to design a platform that ensures each story renders quickly and reliably regardless of its content, and provides these assurances by default so engineers can focus on shipping great products without performance concerns.
Litho is a declarative Android UI framework that Facebook announced and open-sourced earlier this year. Among its other benefits, Litho moves the heavy computation required to render UIs onto separate threads — an architecture known as multithreaded rendering. We decided to move News Feed to this architecture to more efficiently deliver a variety of rich, immersive story formats in a smooth scrolling feed. Multithreaded rendering is technically challenging and usually reserved for complex rendering like 3D games. As a result, one of the challenges we faced was to ensure that the work being split between multiple threads was being executed error-free.
Infer, Facebook’s open source static analyzer, was already developing a new capability that could automatically check for the complex class of bugs that could potentially result from using a multithreaded programming model. We realized that by joining forces, the teams could help each other solve their respective technical challenges. And it worked — the News Feed migration to Litho helped Infer focus on a specific set of problems to tackle, and the new capabilities that resulted provided the assurances we needed to support multithreaded rendering on News Feed. To date, Infer’s new thread-safety analyzer has identified hundreds of issues that have been addressed by our engineers before reaching production, helping to ensure a good experience for the people who use our Android app.
Taken together, this solution not only improved the performance of News Feed — opening the door to even richer front-end user experiences — but proved capable of successfully and reliably executing multithreaded rendering at scale, which has remained elusive on Android until now.
Supporting background layout with Litho
Most Android devices refresh their screen at around a rate of 60 frames per second. A smooth scrolling performance requires the entire computation for a single frame of the UI to complete in less than 16.7 milliseconds. If the computation takes too long, the scroll animation will be subject to skipped frames and interrupt the smooth scrolling experience.
Android scrolling surfaces are typically implemented with the RecyclerView widget, which requests new content to be displayed on screen. The application generates the requested content by transforming raw server data and local device state into interactive visual elements. The entire transformation process must happen synchronously on the UI thread during a single frame; this includes inflating a view hierarchy (unless it’s recycled), binding data to the views, measuring the views’ positions, and finally uploading the drawing instructions to the GPU. This is a lot of work to do in 16.7ms! Given News Feed’s visual complexity, it can be difficult to perform all of these functions on the UI thread without skipping frames.
Litho brings smooth scrolling back to News Feed by performing the heavy pieces of the rendering computation before frame-time on a background thread separate from the Android UI toolkit — a process we call background layout. Then at frame-time, only a small, minimal bit of work needs to happen synchronously on the UI thread. Such an asynchronous, multithreaded rendering architecture has historically been an elusive feat of UI engineering, but is now possible in Java with Litho’s immutable and unidirectional data flow model. In contrast, multithreading optimizations cannot be performed within the confines of the Android UI Toolkit; in fact, it is explicitly discouraged by the Android documentation.
Background layout in News Feed
When creating a new Litho surface from scratch, careful adherence to its functional programming paradigm and state management APIs will prevent most multithreading issues, enabling effortless facilitation of background layout. However, News Feed is an existing surface, originally coded with the overall assumption of single-threadedness. Since we only just moved News Feed to Litho, there are still remaining instances not following the functional paradigm that leave the app potentially vulnerable to multithreading bugs.
In order to enable asynchronous rendering in News Feed we were faced with two main challenges. First, we had to ensure that all potential vulnerabilities in the codebase were thread-safe, or else we risked shipping race conditions and degrading the user experience. Second, adding threading to the rendering pipeline increases the complexity of the code, so we also wanted to enable engineers to work in the codebase without having to worry about introducing new regressions.
The News Feed codebase contains Litho hierarchies responsible for rendering all the different types of stories. Below is a simplified model of one of these hierarchies, where the parent story UI component has children for a title, content, and feedback; the title has children for text and network image.
In reality the hierarchy is much larger as there are thousands of Litho classes, delegates, and dependencies that contribute to the rendering process. At this scale, there is too much code to manually evaluate every method involved for thread safety — and that’s assuming that we are even capable of manually checking for such issues. The source of a race condition or deadlock, the two most common multithreading bugs, is often harder to discover than other types of bugs — even when you know one is present in the application.
Imagine you receive a stack trace caused by a race condition, which is when multiple threads happen to interleave in a way that causes incorrect behavior, it’s unlikely that you will be able to reproduce that condition locally, which prevents you from tracking down the error’s source. Sometimes there is no stack trace, when a race condition causes incorrect behavior but doesn’t crash the app. That’s the case with the picture below where a race condition caused the background resource for the “Add Friend” button to be lamentably superseded by angry faces in an employee pre-release of the app.
Concurrency analysis with Infer
Fortunately, Infer, Facebook’s open source static analyzer, simplified the challenges of enabling asynchronous rendering. Infer can quickly detect various types of bugs in any sized codebase before the code hits production. At Facebook, it runs automatically on every code change, checking for bugs such as null pointer exceptions and resource leaks. If it finds any issues, it will auto-comment on the line of the code change that seems unsafe, enabling engineers to fix it quickly. This helps to reduce the crash rate and prevent shipping a buggy application binary to our users.
When the background layout project on News Feed began, the Infer team was already developing the new thread-safety analyzer. This was challenging: effective static analysis of concurrent code is a longstanding open problem that remains an active area of research, and the difficulties are even more acute when they need to be applied at scale. By teaming up during the migration to Litho and background layout, the development of Infer’s thread-safety capability narrowed its focus to the unique challenges of the migration team, whose feedback in turn allowed the analysis development effort to start small and make steady incremental improvements on what had seemed previously like an intractable problem.
At the core of the thread-safety analysis is the counterintuitive idea that the exact thread(s) on which a particular piece of code runs on does not matter. Instead, Infer simply assumes that any non-private
method on a class marked @ThreadSafe
could potentially run on any thread unless told otherwise. This frees the analyzer to focus on the somewhat easier problem of precisely tracking accesses that might race. Infer reports a race when it detects two accesses to that might touch the same memory where at least one access is a write.
In the example below, the public method makeDinner()
delegates to the private method boilWater()
. Although there are no calls to the new Thread().run()
, Infer assumes that makeDinner()
might be called concurrently on two different threads and then traces the call chain through boilWater()
, where it detects that mTemperature
could have conflicting writes. Infer will generate an error report for mTemperature
, which can be fixed by protecting the write to mTemperature
by guarding its access with a lock (e.g., by making boilWater()
synchronized).
You can imagine how complex this becomes when the call chain goes through lots of different objects, methods, loops, and conditional branches. Part of what makes Infer amazing is that it effectively maintains a precise picture of potentially racy program behavior at a large scale.
For every method in the codebase, the thread-safety analysis summarizes the heap cells accessed by the method, including the cells accessed transitively by callees. It also includes the stack traces to each access, the locks held at the time of the access, and information about which accesses are safe if the parameters passed by the caller are “owned” (e.g., if the caller has a unique pointer to the parameter that it passes). The analysis then aggregates the results for all procedures to determine which accesses may conflict, and it reports write/write races and read/write races based on this information.
We ran Infer on the complete set of Litho classes in our codebase. This enabled us to fix lots of concurrency issues in both the Litho classes and in the classes outside of the UI that they refer to. We also configured Infer to run continuously on all future code changes that touch these classes. In the past 6 months that the concurrency checks have been running, we’ve found many complex bugs that were then fixed before the code was committed to our repo.
Without Infer, multithreading in News Feed would not have been tenable. What’s more, Infer works with any Java/Android code, which will allow all sorts of scalable applications to achieve the efficiency of multithreading in a safe way.
Lessons learned
When we released the initial test for background layout in News Feed, we measured a minor decrease in scroll performance (calculated by sampling the quantity and magnitude of skipped frames in production). We realized that increasing the number of threads for rendering was by itself not a magic solution. With further investigation we noticed different causes of contention between the background layout thread and the UI thread (contention is when a thread has to wait for a resource to become available) that were slowing down rendering.
With some fixes and optimizations, we ultimately enabled background layout successfully in News Feed while maintaining a neutral crash rate and improving scroll performance. We’re just getting started and believe there are still some remaining sources of contention to resolve and additional opportunities for better task scheduling.
We have also seen how effective static analysis can be a game changer in dealing with concurrency, one of the more challenging areas of programming. Infer is effective at finding a specific kind of concurrency issue — data races — quickly and without reporting too many false positives. This success opens up new possibilities for doing even more and we are planning to extend Infer’s capabilities in a number of directions, such as involving new bug types and greater precision, to provide even more help to programmers writing concurrent code.
Both Litho and Infer are open source – including Infer’s new thread-safety check — and we hope that our learnings will help more Android developers enable multithreaded rendering in a safe, scalable way.