Move faster, wait less: Improving code review time at Meta

Code reviews are one of the most important parts of the software development process
At Meta we’ve recognized the need to make code reviews as fast as possible without sacrificing quality
We’re sharing several tools and steps we’ve taken at Meta to reduce the time waiting for code reviews

When done well, code reviews can catch bugs, teach best practices, and ensure high code quality. At Meta we call an individual set of changes made to the codebase a “diff.” While we like to move fast at Meta, every diff must be reviewed, without exception. But, as the Code Review team, we also understand that when reviews take longer, people get less done.

We’ve studied several metrics to learn more about code review bottlenecks that lead to unhappy developers and used that knowledge to build features that help speed up the code review process without sacrificing review quality. We’ve found a correlation between slow diff review times (P75) and engineer dissatisfaction. Our tools to surface diffs to the right reviewers at key moments in the code review lifecycle have significantly improved the diff review experience.

What makes a diff review feel slow?

To answer this question we started by looking at our data. We track a metric that we call “Time In Review,” which is a measure of how long a diff is waiting on review across all of its individual review cycles. We only account for the time when the diff is waiting on reviewer action.

Time In Review is calculated as the sum of the time spent in blue sections.

What we discovered surprised us. When we looked at the data in early 2021, our median (P50) hours in review for a diff was only a few hours, which we felt was pretty good. However, looking at P75 (i.e., the slowest 25 percent of reviews) we saw diff review time increase by as much as a day.

We analyzed the correlation between Time In Review and user satisfaction (as measured by a company-wide survey). The results were clear: The longer someone’s slowest 25 percent of diffs take to review, the less satisfied they were by their code review process. We now had our north star metric: P75 Time In Review.

Driving down Time In Review would not only make people more satisfied with their code review process, it would also increase the productivity of every engineer at Meta. Driving down Time to Review for our diffs means our engineers are spending significantly less time on reviews – making them more productive and more satisfied with the overall review process.

Balancing speed with quality

However, simply optimizing for the speed of review could lead to negative side effects, like encouraging rubber-stamp reviewing. We needed a guardrail metric to protect against negative unintended consequences. We settled on “Eyeball Time” – the total amount of time reviewers spent looking at a diff. An increase in rubber-stamping would lead to a decrease in Eyeball Time.

Now we have established our goal metric, Time In Review, and our guardrail metric, Eyeball Time. What comes next?

Build, experiment, and iterate

Nearly every product team at Meta uses experimental and data-driven processes to release and iterate on features. However, this process is still very new to internal tools teams like ours. There are a number of challenges (sample size, randomization, network effect) that we’ve had to overcome that product teams do not have. We address these challenges with new data foundations for running network experiments and using techniques to reduce variance and increase sample size. This extra effort is worth it — by laying the foundation of an experiment, we can later prove the impact and the effectiveness of the features we’re building.

The experimental process: The selection of goal and guardrail metrics is driven by the hypothesis we hold for the feature. We built the foundations to easily choose different experiment units to randomize treatment, including randomization by user clusters.

Next reviewable diff

The inspiration for this feature came from an unlikely place — video streaming services. It’s easy to binge watch shows on certain streaming services because of how seamless the transition is from one episode to another. What if we could do that for code reviews? By queueing up diffs we could encourage a diff review flow state, allowing reviewers to make the most of their time and mental energy.

And so Next Reviewable Diff was born. We use machine learning to identify a diff that the current reviewer is highly likely to want to review. Then we surface that diff to the reviewer after they finish their current code review. We make it easy to cycle through possible next diffs and quickly remove themselves as a reviewer if a diff is not relevant to them.

After its launch, we found that this feature resulted in a 17 percent overall increase in review actions per day (such as accepting a diff, commenting, etc.) and that engineers that use this flow perform 44 percent more review actions than the average reviewer!

Improving reviewer recommendations

The choice of reviewers that an author selects for a diff is very important. Diff authors want reviewers who are going to review their code well, quickly, and who are experts for the code their diff touches. Historically, Meta’s reviewer recommender looked at a limited set of data to make recommendations, leading to problems with new files and staleness as engineers changed teams.

We built a new reviewer recommendation system, incorporating work hours awareness and file ownership information. This allows reviewers that are available to review a diff and are more likely to be great reviewers to be prioritized. We rewrote the model that powers these recommendations to support backtesting and automatic retraining too.

The result? A 1.5 percent increase in diffs reviewed within 24 hours and an increase in top three recommendation accuracy (how often the actual reviewer is one of the top three suggested) from below 60 percent to nearly 75 percent. As an added bonus, the new model was also 14 times faster (P90 latency)!

Stale Diff Nudgebot

We know that a small proportion of stale diffs can make engineers unhappy, even if their diffs are reviewed quickly otherwise. Slow reviews have other effects too — the code itself becomes stale, authors have to context switch, and overall productivity drops. To directly address this, we built Nudgebot, which was inspired by research done at Microsoft.

For diffs that were taking an extra long time to review, Nudgebot determines the subset of reviewers that are most likely to review the diff. Then it sends them a chat ping with the appropriate context for the diff along with a set of quick actions that allow recipients to jump right into reviewing.

Our experiment with Nudgebot had great results. The average Time In Review for all diffs dropped 7 percent (adjusted to exclude weekends) and the proportion of diffs that waited longer than three days for review dropped 12 percent! The success of this feature was individually published as well.

This is what a chat notification about a set of stale diffs looks like to a reviewer, while showing one of the potential interactions of “Remind Me Later.”

What comes next?

Our current and future work is focused on questions like:

What is the right set of people to be reviewing a given diff?
How can we make it easier for reviewers to have the information they need to give a high quality review?
How can we leverage AI and machine learning to improve the code review process?

We’re continually pursuing answers to these questions, and we’re looking forward to finding more ways to streamline developer processes in the future!

Are you interested in building the future of developer productivity? Join us!