Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Cassandra lowers the barriers to big data

Rick Grehan | March 25, 2014
Apache Cassandra is a free, open source NoSQL database designed to manage very large data sets (think petabytes) across large clusters of commodity servers. Among many distinguishing features, Cassandra excels at scaling writes as well as reads, and its "master-less" architecture makes creating and expanding clusters relatively straightforward. For organizations seeking a data store that can support rapid and massive growth, Cassandra should be high on the list of options to consider.

Because Cassandra is distributed, a cluster's members require a mechanism for discovering one another and communicating state information. This is where Cassandra's Gossip protocol comes in. As you might suspect, Gossip gets its name from the human activity of passing information throughout a group via apparently random, person-to-person conversations.

Certain nodes in a cluster are designated as "seed" nodes. Each second, a timer on a Cassandra node fires, initiating communication with two or three randomly selected nodes in the cluster, one of which must be a seed node. Consequently, seed nodes will tend to have the most up-to-date view of a cluster. (When a new node is added to a cluster, it first contacts a seed node.)

Cassandra works to keep Gossip communication efficient. Each node maintains two sorts of states. HeartBeatState tracks the node's version number, which is incremented any time information on the node has changed, and how often the node was restarted. ApplicationState tracks the operational state of the node (such as the current load). Nodes exchange digests of HeartBeatState information with one another. If differences are found, the nodes then exchange digests of ApplicationState info, and ultimately the ApplicationState data itself. In addition, the Gossip algorithm first seeks to resolve differences that are "farther apart" (in terms of version numbers), since those are more likely to embody the widest inconsistencies.

Working with Cassandra

RDBMS users familiar with SQL should feel right at home with CQL, the Cassandra Query Language, which can be executed from the Python-based Cassandra shell utility (cqlsh) or through any of several client drivers. Client drivers are available from websites like Planet Cassandra, where you'll find CQL-enabled drivers for Java, C#, Node.js, PHP, and others.

In the past, drivers communicated with a Cassandra cluster using a Thrift API — Thrift being a framework for creating what amounts to language-independent remote procedure calls for client and server. Cassandra's Thrift API is now considered a legacy feature, as the CQL specification defines not only the CQL language, but an on-the-wire communication protocol as well.

CQL's syntax resembles its relational cousin's. It has SELECT, INSERT, UPDATE, and DELETE statements, and these are accompanied by FROM and WHERE clauses. In addition, CQL's data types are what you would expect. You'll find integers, floats and doubles, blobs, and more. Of course, there are differences. For one, CQL has no JOIN operation. And when you write a FROM clause, you specify column families — though, as of the latest version of CQL, the term "table" is used in place of "column family." CQL also lets you specify the desired consistency level for any operation, but its real benefit is that it is a data management language quickly grasped by relational programmers, and is independent of a specific programming API.


Previous Page  1  2  3  4  5  Next Page 

Sign up for CIO Asia eNewsletters.