HBase's data model
HBase organizes data differently from traditional relational databases, supporting a four-dimensional data model in which each "cell" is represented by four coordinates:
- Row key: Each row has a unique row key that is represented internally by a byte array, but does not have any formal data type.
- Column family: The data contained in a row is partitioned into column families; each row has the same set of column families, but each column family does not need to maintain the same set of column qualifiers. You can think of column families as being similar to tables in a relational database.
- Column qualifier: These are similar to columns in a relational database.
- Version: Each column can have a configurable number of versions. If you request the data contained in a column without specifying a version then you receive the latest version, but you can request older versions by specifying a version number.
Figure 1 shows how these four dimensional coordinates are related.
Figure 1. HBase data mode
The model in Figure 1 shows that a row is comprised of a row key and an arbitrary number of column families. Each row key is associated to a collection of "rows in tables," each of which has its own columns. While each table must exist, the columns in tables may be different across rows. Each column family has a set of columns, and each column has a set of versions that map to the actual data in the row.
If we were modeling a person, the row key might be the person's social security number (to uniquely identify them), and we might have column families like address, employment, education, and so forth. Inside the address column family we might have street, city, state, and zip code columns, and each version might correspond to where the person lived at any given time. The latest version might list the city "Los Angeles," while the previous version might list "New York." You can see this example model in Figure 2.
Figure 2. Person model in HBase
In sum, HBase is a column-oriented database that represents data in a four dimensional model. It is built on top of the Hadoop Distributed File System (HDFS), which partitions data across potentially thousands of commodity machines. Developers using HBase can access data directly by accessing a row key, by scanning across a range of row keys, or by using batch processing via MapReduce.
You may or may not be familiar with the famous (to geeks) Big Data White Papers. Published by Google Research between 2003 and 2006, these white papers presented the research for three pillars of the Hadoop ecosystem as we know it:
- Google File System (GFS): The Hadoop Distributed File System (HDFS) is an open source implementation of the GFS and defines how data is distributed across a cluster of commodity machines.
- MapReduce: A functional programming paradigm for analyzing data that is distributed across an HDFS cluster.
- Bigtable: A distributed storage system for managing structured data that is designed to scale to very large sizes -- petabytes of data across thousands of commodity machines. HBase is an open source implementation of Bigtable.
Sign up for CIO Asia eNewsletters.