This article will introduce you to Apache Cassandra by talking about its characteristics, components and use cases.
What is Apache Cassandra?
SQL databases (such as Oracle, SQL Server, MySQL etc.) have evolved in terms of features, maturity and stability over last few decades. These databases have been instrumental in success of many applications dealing with structured data such as ERP Data, census records, library catalogues etc.
However, in recent years, companies are utilizing unstructued data (images, audio, web logs, emails etc.) to derive and enhance their businesses due to advent of social networking and reduced cost of computing resources such as CPU and memory. This unstructred data does not conform to any pre-defined schema and dynamic in nature. Additionally, this data is so huge (in Terabytes or Petabytes) and therfore very expensive to be processed on server machines, as it requires costly, dedicated and sophisticated hardware.
This is where NoSQL databases come to rescue and make it possible to manage this huge unstructured data. NoSQL databases have capability to store the data on multiple machines and response to the client queries in few miliseconds. Since these databases use multiple machines for storing the data, there is no need for costly and sophisticated hardware. This helps companies manager their data in a cost effective manner by letting them use commodity hardware.
NoSQL databases broadly come in following four flavors:
- Key-value databases - These databases allow data to be saved into key-value parameters. However, key and value could be of any type such as emails, text, image bytes etc. These databases are quite useful for building caches. Some of the examples of these databases are Aerospike and Redis stores.
- Column-family databases - These databases manage data in rows and columns like SQL databases. However, unlike SQL, rows in different databases can have different columns. E.g. In a table, row1 could have 2 columns while row2 could have 3 columns. This makes it easy to have different schema for different rows. Some of examples of these databases are Apache Cassandra and Apache HBase.
- Document databases - These databases store data in form of documents often in form of JSON documents. This provides greater flexibility as it is possible to have completely different structures in different documents of a collection (table). Some of examples of these databases are MongoDB and Couchbase database.
- Graph databases - These databases are useful to model the relationships between entities are handy for storing people relationships and network graphs. One of such database is Neo4j.
Apache Cassandra is an open source column-family NoSQL database that helps us achieve high performance coupled with high availability and scalability. We will be discussing its characteristics and components in following sections in detail.
Characteristics of Apache Cassandra
Here are some of the characterisitics of Apache Cassnadra that make it worthy to be utilized for managing huge structured and unstructured data:
- Column-Family NoSQL Database - Apache Cassandra manages data in rows and columns like SQL databases. However, unlike SQL, rows in different databases can have different columns. E.g. In a table, row1 could have 2 columns while row2 could have 3 columns. This makes it easy to have different schema for different rows.
- Decentralized - Apache Cassandra, unlike SQL databases, is not centralized into one machine. It instead utilizies concept of distributed storage by distributing the data on to multiple machines.
- Linear Scalability - One of the biggest factors to choose a NoSQL database is linear scalability. Databases with linear scalability can handle more data in proportion to number of machines added to its cluster. E.g. If it can handle 200,000 requests per second with 2 machines cluster, it will be able to handle 300,000 requests per second by adding one more machine to cluster. This behaviour makes scalability predictable and utilizes hardware resources efficiently. Apache Cassandra happens to provide linear scalability and hence best choice in terms of scalability.
- Sharding - This characteristics is similar to decentralized concept and talks about dividing data into different machines. These separate machines hosting part of data are called Shards. This characteristic makes Apache Cassandra ideal to handle huge amount of data as we can theoritcally handle any amount of data by adding more machines to our cluster.
- Fault Tolerant - Fault tolerance is another important characteristic as it prevents data loss in case of machine failures. Apache Cassandra uses the concept of replication in order to be fault-tolerant. It creates multiple copies of data and put these into differnet machines in different data centres as possible. This requires more storage but avoids data loss even in cases where a whole data centre has gone down.
- Durability - Apache Cassandra also provides data durability as it persists all the data in a structure called Commit Log on file system before actually flushing data to tables. It means that it does not loose data on restart of machines or whole cluster.
- High Performance - Apache Cassandra provides amazing performance due to its append-only sequential I/O data structure, It does not require to update the records by seeking on disk as disk seeking times are quite high. It instead flushes the batch of updates from memory to disk as immutable records.
- Configurable Replication Mode and Placement Strategy - Apache Cassandra allows to configure total number of data replicas in a cluster along with Replica placement strategy. Replica placement strategy helps to decide whether Replicas should be put in same data centre or in different data centers.
- Proprietary Query Language - Apache Cassandra also comes with its own proprietray query language called Cassandra Query Language (CQL). It's syntax is quite similar to SQL and is easily accessible from Cassandra Shell. This characteristic is quite useful for people who are used to working with SQL.
Components of Apache Cassandra
After characteristics, let's now look at various components of Apache Cassandra in order to understand its working in better way:
- Node/Coordinator - This refers to a physical machine running Apache Cassandra instance. All the data is stored on these node machines. These nodes also act as coordinator as these pass on the client requirements and state information received from previous node to next node.
- Data Center - This refers to physical place where computer and their storage systems are placed. It generally includes redundant or backup power supplies. In Cassandra context, a data center will host a collection of nodes containing the data.
- Cluster and Ring Topology - Cluster is a collection of multple nodes hosted in one or multiple data centers. Apache Cassandra cluster follows Ring topology as all nodes are connected to two other nodes forming a Ring as shown in above diagram.
- Virtual Nodes (Vnodes) - Virtual nodes is a recent innovation by Apache Cassandra in order to leverage heterogenous hardware and minimize movement of data when nodes are added and removed from cluster. A Cassandra Node normally has multiple Vnodes and each of these Vnodes is responsible for managing the data for a particular partition key value. For example, one machine with 2 Cores and 4 GB could have 2 virtual nodes while other machine with 4 Cores and 8 GB could have 4 virtual nodes. This will allow second machine to manager double data than the first one and hence helps utilize the full potential of hardware resources.
- Commit Log - This is a temporary log maintained by Cassandra to store all the writes to tables. This ensures that there is no data loss if a machine is restarted due to some unexpected scenarion. Cassandra keeps the updates in memory (memtables) and flushes these to actual tables periodically as immutable records. Once this in-memory table data is flushed, corresponding data from commit log is also removed.
- Table and SSTable - Table is simply an ordered collection of columns for each row. Row typically consists of columns along with a primary key. Table is more of a logical structure that client works with. On the other hand, SSTable (Sorted String Table) represents data file of immutable records to which Apache Cassandra flushes in-memory tables periodically. SSTables are maintained for each of Cassandra tables.
- Partitioner - A partitioner is responsible for assigning the data to particular nodes. It also takes care of assinging other machines for replication of data. Each row of data is uniquely identified by a primary key, that may be same as partition key. A partitioner is basically a hash function that dervies the tokens from partition key of a row. Partitioner uses these tokens to identify nodes as each of the nodes are responsible for managing the data for few token. Nodes manage the data of multiple tokens by assigning token to their virtual nodes. In order to configure this, we need to set num_tokens value while setting up partitioner in cluster. There are following three types of partitioners in Cassandra:
- Murmur3Partitioner - This partitioner uses Murmur hash function to derive token from partition key of a row. This partitioner uniformly distributes data across all the nodes of a cluster. This is also default partitioner and works well for almost all the scenarios.
- RandomPartitioner - This partitioner uses MD5 hash function but is comparatively slower than Murmur hash. Same as Murmur3 partitioner, this partitioner uniformly distributes data across all the nodes of a cluster.
- ByteOrderedPartitioner - This partitioner makes it possible to keep an ordered distribution of data lexically by key bytes. However, this is difficult to manage and also does not evenly load balance and dstribute data uniformly in case of multiple tables.
- Gossip - This is a protocol used by nodes to communicate rack, data center and state information. As the name suggests, it works like Gossip wherein each node passes this information to next node, eventually making this information available to all nodes in cluster.
- Snitch - Snitch is a mechanism for Cassandra nodes to define which racks and data centers these belong to. This information is quite useful for replication strategy to place relicas in different racks and data centers. Snitch needs to be configured at the time of creation of cluster. If not configured, SimpleSnitch is used that does not recognise racks and data centers. However, in production, GossipingPropertyFileSnitch is recommended that picks up node's data center and rack information from property file and passes this information to all the machines in cluster using Gossip protocol.
Use Cases of Apache Cassandra
Apache Cassandra can be used for any use case that demands massive scalability, an always on architecture, high performance, strong security, and ease of management, to name a few. Here are some of high level domain use cases of Apache Cassandra:
- Product Catalog Management
- Playlist Management
- Recommendation Engine Data Management
- Fraud Detection
- Messaging Data Management
- IOT / Sensor Data Management
Thank you for reading through the tutorial. In case of any feedback/questions/concerns, you can communicate same to us through your comments and we shall get back to you as soon as possible.