Cassandra – A structured storage system on a P2P Network

When I joined Facebook I was eagerly looking forward to a new challenge. Fortunately, Facebook cannot be accused of a lack of challenging assignments. Prashant Malik,a colleague in Facebook from the Search team, was thinking about how to solve the Inbox Search problem. This challenge is about storing reverse indices of Facebook messages that Facebook users send and receive while communicating with their friends on the Facebook network. The amount of data to be stored, the rate of growth of the data and the requirement to serve it within strict SLAs made it very apparent that a new storage solution was absolutely essential. The solution needed to scale incrementally and in a cost effective fashion. Traditional data storage solutions just wouldn’t fit the bill. The aim was to design a solution that not only solved the Inbox Search problem but also provided a system as a storage infrastructure for many problems of the same nature. Hence was born Cassandra. To keep up with Facebook tradition, Prashant and I started the implementation of Cassandra about a year ago in one of our Hackthons.

Cassandra is a distributed storage system for managing structured data that is designed to scale to a very large size across many commodity servers, with no single point of failure. Reliability at massive scale is a very big challenge. Outages in the service can have significant negative impact. Hence Cassandra aims to run on top of an infrastructure of hundreds of nodes (possibly spread across different datacenters). At this scale, small and large components fail continuously; the way Cassandra manages the persistent state in the face of these failures drives the reliability and scalability of the software systems relying on this service. Cassandra has achieved several goals – scalability, high performance, high availability and applicability. In many ways Cassandra resembles a database and shares many design and implementation strategies with databases. Cassandra does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format. The rest of the material talks about the data model and the distributed properties, provided by the system.

Data Model

Every row is identified by a unique key. The key is a string and there is no limit on its size.
An instance of Cassandra has one table which is made up of one or more column families as defined by the user.
The number of column families and the name of each of the above must be fixed at the time the cluster is started. There is no limitation the number of column families but it is expected that there would be a few of these.
Each column family can contain one of two structures: supercolumns or columns. Both of these are dynamically created and there is no limit on the number of these that can be stored in a column family.
Columns are constructs that have a name, a value and a user-defined timestamp associated with them. The number of columns that can be contained in a column family is very large. Columns could be of variable number per key. For instance key K1 could have 1024 columns/super columns while key K2 could have 64 columns/super columns.
“Supercolumns” are a construct that have a name, and an infinite number of columns assosciated with them. The number of “Supercolumns” associated with any column family could be infinite and of a variable number per key. They exhibit the same characteristics as columns.

Distribution, Replication and Fault Tolerance

Data is distributed across the nodes in the cluster using Consistent Hashing based and on an Order Preserving Hash function. We use an Order Preserving Hash so that we could perform range scans over the data for analysis at some later point.
Cluster membership is maintained via Gossip style membership algorithm. Failures of nodes within the cluster are monitored using an Accrual Style Failure Detector.
High availability is achieved using replication and we actively replicate data across data centers. Since eventual consistency is the mantra of the system reads execute on the closest replica and data is repaired in the background for increased read throughput.
System exhibits incremental scalability properties which can be achieved as easily as dropping nodes and having them automatically bootstrapped with data.

First deployment of Cassandra system within Facebook was for the Inbox search system. The system currently stores TB’s of indexes across a cluster of 600+ cores and 120+ TB of disk space. Performance of the system has been well within our SLA requirements and more applications are in the pipeline to use the Cassandra system as their storage engine. A beta version of Cassandra has been open sourced and can be found here. Systems of this nature are never really done. One starts by building certain core features that are absolutely required, achieves to get them right and proceeds to build the more fancy features. We have certain core features that needed to be built and have a lot more planned in the road ahead. If one is interested in solving hard distributed systems problems which affects the lives of millions of our users, feel free to contact us at Facebook