What the research is:
The Ribbon filter is a new data structure that is more space-efficient than the popular Bloom filters that are widely used for optimizing data retrieval. One of the ways that Bloom, and now Ribbon, filters solve real engineering problems is by providing smooth configurability unmatched by other filters. Bloom filters work by overapproximating a set of keys associated with some data resource. With a Bloom filter, almost all negative queries to that resource can be skipped (filtered) because the Bloom filter rejects almost all query keys not associated with the resource.
With proper data layout and design, the Ribbon filter is the first Bloom alternative to match the near-continuous, hazard-free, accuracy-versus-space trade-off provided by Bloom filters.
Here, near-continuous means efficiently utilizing any amount of memory to represent any number of keys, so that wasted memory such as in internal fragmentation can be minimized to zero. The typical hazard to the accuracy-versus-space trade-off is bit alignment, where some data sizes (e.g., 4, 8, or 16 bits per key) are faster to access than others. Like Bloom, our data layout for Ribbon does not suffer this hazard in access times so it is more freely configurable. And Ribbon filters add new freedom of configurability to the space-versus-time trade-off.
Building on some prior lines of research, the Ribbon filter combines a simplified, faster, and more flexible construction algorithm; a data layout optimized for filter queries; and near-continuous configurability to make a practical alternative to static (immutable) Bloom filters.
While well-engineered Bloom filters are extremely fast, they use roughly 50 percent more space (overhead) than the information-theoretic lower bound for filters on arbitrary keys. When Bloom filters cannot meet an application’s space efficiency targets, Ribbon filter variants dominate in space-versus-time trade-offs with near continuous configurability and space overhead as low as 1 percent or less. Ribbon filters have O(1) query times and save roughly 1/3 of memory compared with Bloom filters.
How it works:
Like some related immutable structures used for perfect hashing and maps, Ribbon filters are constructed by solving a linear system given by hash functions applied to a set of keys. Each row in the linear system expresses that querying as some key, which involves XOR-ing the values at some set of array indices, must yield a prescribed value to indicate it is “in” the set of keys.
Despite using Boolean — GF(2) — arithmetic, the approach to solving this logical system of equations is to use Gaussian elimination, which fundamentally means subtracting equations from one another until you can isolate variables (unknowns). If a solution exists, this approach will find it.
The name Ribbon has dual meanings. First, we use a linear system from Dietzfelbinger and Walzer whose sorted coefficient matrix resembles a physical ribbon, or a wavy approximation of a band matrix. Gaussian elimination is fundamentally more efficient on this system because it is already close to a reduced form.
Ribbon also stands for Rapid Incremental Boolean Banding ON the fly, which is the name of our fast and flexible new Gaussian solver. Through an approach resembling insertion into a linear-probed hash table, Ribbon does Gaussian elimination on the fly. This saves time and space in construction because row reductions can be done in registers rather than in memory, and because the reduced form of the ribbon coefficient matrix — a band matrix — is more space-efficient than explicitly representing the ribbon form. On-the-fly construction also unlocks a solution to the core challenge of the Ribbon approach: scaling its space efficiency to very large numbers of keys.
Why it matters:
At Facebook’s scale, we expect Ribbon filters to save several percent of RAM resources, with a tiny increase in CPU usage for some major storage systems. However, we do not implement efficiency gains at all engineering costs, so it’s also important to have a user-friendly data structure. This issue stalled implementation of other Bloom alternatives offering some space savings.
The Ribbon filter opens these new trade-offs without introducing notable discontinuities or hazards in the configuration space. In other words, there is some complexity to make Ribbon filters general and highly configurable, but these details can be hidden behind a relatively simple API. You have essentially free choice over any three of the four core performance dimensions — number of keys added to the set, memory usage, CPU efficiency, and accuracy — and the accuracy is automatically well optimized.