Silent data corruption: Mitigating effects at scale

What the research is:

Silent data corruption, or data errors that go undetected by the larger system, is a widespread problem for large-scale infrastructure systems. This type of corruption can propagate across the stack and manifest as application-level problems. It can also result in data loss and require months to debug and resolve. This work describes the best practices for detecting and remediating silent data corruptions on a scale of hundreds of thousands of machines.

In our paper, we research common defect types observed in CPUs, using a real-world example of silent data corruption within a data center application, leading to missing rows in the database. We map out a high-level debug flow to determine root cause and triage the missing data.

We determine that reducing silent data corruption requires not only hardware resiliency and production detection mechanisms, but also robust fault-tolerant software architectures.

How it works:

Silent errors can happen during any set of functions within a data center CPU. We describe one example in detail to illustrate the debug methodology and our approach to tackling this in our large fleet. In a large-scale infrastructure, files are usually compressed when they are not being read, and decompressed when a request is made to read the file. Millions of these operations are performed every day. In this example, we mainly focus on the decompression aspect.

Before decompression is performed, file size is checked to see if the file size > 0. A valid compressed file with contents would have a nonzero size. In our example, when the file size was being computed, a file with a nonzero file size was provided as input to the decompression algorithm. Interestingly, the computation returned a value of 0 for a nonzero file size. Since the result of the file size computation was returned as 0, the file was not written into the decompressed output database.

As a result, for some random scenarios, when the file size was non-zero, the decompression activity was skipped. As a result, the database that relied on the actual content of the file had missing files. These files with blank contents and/or incorrect size propagate to the application. An application that keeps a list of key value store mappings for compressed files immediately observes that some files that are compressed are no longer recoverable. This chain of dependencies causes the application to fail, and eventually the querying infrastructure reports data loss after decompression. The complexity is magnified as this happens occasionally when engineers schedule the same workload on a cluster of machines.

Detecting and reproducing this scenario in a large scale environment is very complex. In this case, the reproducer at a multi-machine querying infrastructure level was reduced to a single machine workload. From the single machine workload, we identified that the failures were truly sporadic in nature. The workload was identified to be multi-threaded, and upon single threading the workload, the failure was no longer sporadic but consistent for a certain subset of data values on one particular core of the machine. The sporadic nature associated with multi-threading was eliminated but the sporadic nature associated with the data values persisted. After a few iterations, it became obvious that the computation of

Int (1.1⁵³) = 0

as an input to the math.pow function in Scala will always produce a result of 0 on Core 59 of the CPU. However, if the computation is changed to

Int (1.1⁵²) = 142

the result is accurate.

The above diagram documents the root-cause flow. The corruption affects calculations that can be nonzero as well. For example, the following incorrect computations were performed on the machine that was identified as defective. We identified that computation affected positive and negative powers for specific data values and in some cases, the result was nonzero when it should be zero. Incorrect values were obtained with varying degrees of precision.

Example errors:

Int [(1.1)³] = 0 , expected = 1

Int [(1.1)¹⁰⁷] = 32809 , expected = 26854

Int [(1.1)^-3] = 1 , expected = 0

For an application, this results in decompressed files that are incorrect in size and incorrectly truncated without an end-of-file (EoF) terminator. This leads to dangling file nodes, missing data, and no traceability of corruption within an application. Intrinsic data dependency on the core, as well as the data inputs, makes these types of problems computationally hard to detect and determine the root cause without a targeted reproducer. This is especially challenging when there are hundreds of thousands of machines performing a few million computations every second.

After integrating the reproducer script into our detection mechanisms, additional machines were flagged for failing the reproducer. Multiple software- and hardware-resilient mechanisms were integrated as a result of these investigations.

Why it matters:

Silent data corruptions are becoming a more common phenomena in data centers than previously observed. In our paper, we present an example that illustrates one of the many scenarios that can be encountered in dealing with data-dependent, reclusive, and hard-to-debug errors. Multiple strategies of detection and mitigation bring additional complexity to large-scale infrastructure. A better understanding of these corruptions helps us increase the fault tolerance and resilience of our software architecture. Together, these strategies help us build the next generation of infrastructure computing to be more reliable.