Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Open source Java projects: Apache Spark

Steven Haines | Aug. 26, 2015
High-performance big data analysis with Spark!

The following steps summarize the execution model for distributed Spark applications:

  1. Submit a Spark application using the spark-submit command.
  2. The spark-submit command will launch your driver program's main() method.
  3. The driver program will connect to the cluster manager and request resources on which to launch the executors.
  4. The cluster manager deploys the executor code and starts their processes on those resources.
  5. The driver runs and sends tasks to the executors.
  6. The executors run the tasks and sends the results back to the driver.
  7. The driver completes its execution, stops the Spark context, and the resources managed by the cluster manager are released.

Applying Spark to different technologies

We've reviewed the Spark programming model and seen how Spark applications are distributed across a Spark cluster. We'll conclude with a quick look at how Spark can be used to analyze different data sources. Figure 2 shows the logical components that make up the Spark stack.

osjp spark fig2
Figure 2. Spark technology stack. Credit: Steven Haines

At the center of the Spark stack is the Spark Core. The Spark Core contains all the basic functionality of Spark, including task scheduling, memory management, fault recovery, integration with storage systems, and so forth. It also defines the resilient distributed datasets (RDDs) that we saw in our WordCount example and the APIs to interact with RDDs, including all of the transformations and actions we explored in the previous section.

Four libraries are built on top of the Spark Core that allow you to analyze data from other sources:

  1. Spark SQL: Allows for querying data via SQL as well as the Hive Query Language (HQL), if you are running Hive on top of Hadoop. The important thing about Spark SQL is that it allows you to intermix SQL queries with the programmatic RDD operations and actions that we saw in the WordCount example.
  2. Spark Streaming: Allows you to process live streaming data, such as data that is generated by an application's log file or live feeds from a message queue. Spark Streaming provides an API that is very similar to the core RDD operations and actions.
  3. MLib: Provides machine learning functionality and algorithms, including classification, regressions, clustering, and collaborative filtering.
  4. GraphX: Extends the Spark Core RDD API to add support for manipulating graph data, such as might be found in a social network.

So where do you go from here? As we have already seen, you can load data from a local file system or a distributed file system such as HDFS and S3. Spark provides capabilities to read plain text, JSON, sequence files, protocol buffers, and more. Additionally, Spark allows you to read structured data through Spark SQL, and it allows you to read key/value data from data sources such as Cassandra, HBase, and ElasticSearch. For all these instances the process is the same:

  1. Create one or more RDDs that references your data sources (file, SQL database, HDFS files, key/value store, etc).
  2. Transform those RDDs into the format that you want, including operating on two RDDs together.
  3. If you are going to execute multiple actions, persist your RDDs.
  4. Execute actions to derive your business value.
  5. If you persisted your RDDs, be sure to unpersist them when you're finished.


Previous Page  1  2  3  4  5  6  7  8  9  Next Page 

Sign up for CIO Asia eNewsletters.