Apache Phoenix is a relatively new open source Java project that provides a JDBC driver and SQL access to Hadoop's NoSQL database: HBase. It was created as an internal project at Salesforce, open sourced on GitHub, and became a top-level Apache project in May 2014. If you have strong SQL programming skills and would like to be able to use them with a powerful NoSQL database, Phoenix could be exactly what you're looking for!
This installment of Open source Java projects introduces Java developers to Apache Phoenix. Since Phoenix runs on top of HBase, we'll start with an overview of HBase and how it differs from relational databases. You'll learn how Phoenix bridges the gap between SQL and NoSQL, and how it's optimized to efficiently interact with HBase. With those basics out of the way, we'll spend the remainder of the article learning how to work with Phoenix. You'll set up and integrate HBase and Phoenix, create a Java application that connects to HBase through Phoenix, and you'll write your first table, insert data, and run a few queries on it.
Four types of NoSQL data store
It is interesting (and somewhat ironic) that NoSQL data stores are categorized by a feature that they lack, namely SQL. NoSQL data stores come in four general flavors:
- Key/value stores map a specific key to a value, which may be a document, an array, or a simple type. Examples of key/value stores include Memcached, Redis, and Riak.
- Document stores manage documents, which are usually schema-less structures, like JSON, that can be of arbitrary complexity. Most document stores provide support for primary indexes as well as secondary indexes and complex queries. Examples of document stores include MongoDB and CouchBase.
- Graph databases focus primarily on the relationships between objects in which data is stored in nodes and in the relationships between nodes. An example of a graph database is Neo4j.
- Column-oriented databases store data as sections of columns of data rather than as rows of data. HBase is a column-oriented database, and so is Cassandra.
HBase: A primer
Apache HBase is a NoSQL database that runs on top of Hadoop as a distributed and scalable big data store. HBase is a column-oriented database that leverages the distributed processing capabilities of the Hadoop Distributed File System (HDFS) and Hadoop's MapReduce programming paradigm. It was designed to host large tables with billions of rows and potentially millions of columns, all running across a cluster of commodity hardware.
Apache HBase combines the power and scalability of Hadoop with the ability to query for individual records and execute MapReduce processes.
In addition to capabilities inherited from Hadoop, HBase is a powerful database in its own right: it combines real-time queries with the speed of a key/value store, a robust table-scanning strategy for quickly locating records, and it supports batch processing using MapReduce. As such, Apache HBase combines the power and scalability of Hadoop with the ability to query for individual records and execute MapReduce processes.
Sign up for CIO Asia eNewsletters.