Minesweeper automates root cause analysis as a first-line defense against bugs

Root cause analysis (RCA) is an important part of fixing any bug. After all, you can’t solve a problem without getting to the heart of it. But RCA isn’t always simple, especially at a scale like Facebook’s. When billions of people are using an app on a variety of platforms and devices, a single bug can create several different issues on its own and multiple bugs can happen simultaneously.

There was a time when on-call engineers had to spend hours, or even days, manually combing through error reports, looking for patterns to help them debug. Today, they can use Minesweeper — a technique we’ve developed for automating RCA that identifies the causes of bugs based on their symptoms.

Minesweeper-based RCA is completely automated and scalable, and it’s grounded in formal statistical concepts. Our own evaluations of Minesweeper using real-world bug reports from Facebook’s apps have proven that it can perform RCA for tens of thousands of reports in minutes and can identify the root cause of bugs with 85 percent accuracy.

Since it was developed, Minesweeper has become Facebook’s first line of defense against bugs and has helped us prevent potentially wide-scale disruptions from affecting people on our apps.

How does Minesweeper work?

Anytime someone reports a bug through the Facebook app, the error reporting system typically captures a chronological trace of actions (or “events”) that person performed in the app prior to encountering the bug. The idea is to get a snapshot of what might have caused the error to happen, such as the example below:

Minesweeper scans these traces to look for distinctive patterns that could point to the cause of a bug. Traces that contain the bug (the test group) are compared with traces that do not (the control group). Minesweeper finds patterns of events that are statistically distinctive to the test group as opposed to the control group. These patterns are likely to be correlated to the bug and can thereby point toward its root cause.

Here’s an example. Suppose (hypothetically) that 10 people are using the Facebook app. Five of these people report a problem, and there are eight possible events tracked in the app (a, b, … through h). We have 10 traces in total on these events, five in the test group (T) from people who encountered the bug, and five in the control group (C) from the remaining people who did not.

When Minesweeper is applied to these traces, it begins by extracting sequential patterns in T and C. A sequential pattern is simply a chronological sequence of events that happened, but not necessarily one after another (i.e., there might be other events in between that were not significant).

We could visualize the data like this:

Events: { a, b, …, h }

Test group T	Control group C
t₁: a → b → c → d	t₆: a → b → d
t₂: a → b → c	t₇: a → c → d
t₃: b → c	t₈: a → c
t₄: e → f → g → h	t₉: f → g
t₅: e → g	t₁₀: e → f → h

An example of a pattern that the system can extract here is a → c. For each pattern, it computes the number of traces in which it appears in each group (T and C). In the example above, this pattern appears twice in T (t₁and t₂), and so its support in T is 2. In this case, its support in C is also 2 (t₇and t₈). Note that the pattern space is combinatorial in nature, and therefore it is crucial to employ algorithms that can search this space efficiently without an exponential blowup.

Once all patterns are extracted along with their supports in T and C, the system performs statistical isolation. For each pattern, P, it computes precision and recall using its support:

Informally, precision describes how accurate P is in detecting whether a given trace is in the test group rather than the control group, and recall describes how much of the test group P can cover. For example, the pattern b has a precision of 0.75 because it occurs in four traces in total, three of which are in the test group (i.e., it is 75 percent specific to the test group). It also has a recall of 0.6 because it occurs in three out of five traces in the test group (i.e., it covers 60 percent of the test group).

The harmonic mean of the two, 0.67, is the F1-score. The system computes the F1-score of all patterns in this manner and returns the list of all patterns ranked by F1-score, as shown in the table below.

In this example, the result shows that the pattern b → c is the highest ranked. In a real setting, an engineer debugging the reports can infer that events b and c, occurring in that order, are suspicious and deserve close inspection.

Automating RCA at scale

To make Minesweeper work at Facebook’s scale and complexity, we borrowed an idea from the data mining community — sequential pattern mining — to help track the order that events appear in traces.

At Facebook, it’s not unusual to encounter tens of thousands of traces related to a bug. To handle this, we leveraged the PrefixSpan algorithm, which is well known for being highly efficient at sequential pattern mining. To rank patterns according to their distinctiveness to the test group, thereby being useful for RCA, we utilized the aforementioned statistical approach based on the precision and recall of patterns. Finally, to make Minesweeper practical from a usability standpoint, we addressed several human-centric challenges, such as avoiding “redundant” patterns that are similar in explaining the root cause.

The figure below gives a high-level overview of the automated RCA architecture.

The impact of Minesweeper

Minesweeper has played an important role in helping Facebook’s engineers analyze and diagnose regressions — sudden spikes in a group of crash or bug reports — providing insights in minutes that once could have taken days to gather. With the complexity of Facebook’s apps and the product release cycle, it is common for multiple regressions to take place simultaneously, especially after a new version is released. But thanks to Minesweeper, engineers working on user-facing regressions can analyze them promptly, making it easier than ever for them to reduce and mitigate disruptions to Facebook’s services.

More technical information is available in our paper “Scalable statistical root cause analysis on app telemetry.”