Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Open source Java projects: Apache Spark

Steven Haines | Aug. 26, 2015
High-performance big data analysis with Spark!

The WordCount application's main method accepts the source text file name from the command line and then invokes the workCountJava8() method. It defines two helper methods -- wordCountJava7() and wordCountJava8() -- that perform the same function (counting words), first in Java 7's notation and then in Java 8's.

WordCount in Java 7

The wordCountJava7() method is more explicit, so we'll start there. We first create a SparkConf object that points to our Spark instance, which in this case is "local." This means that we're going to be running Spark locally in our Java process space. In order to start interacting with Spark, we need a SparkContext instance, so we create a new JavaSparkContext that is configured to use our SparkConf. Now we have four steps:

  1. Load our input data.
  2. Parse our input into words.
  3. Reduce our words into a tuple pair that contains the word and the count of occurrences.
  4. Save our results.

The first step is to leverage the JavaSparkContext's textFile() to load our input from the specified file. This method reads the file from either the local file system or from a Hadoop Distributed File System (HDFS) and returns a resilient distributed dataset (RDD) of Strings. An RDD is Spark's core data abstraction and represents a distributed collection of elements. You'll find that we perform operations on RDDs, in the form of Spark transformations, and ultimately we leverage Spark actions to translate an RDD into our desired result set.

In this case, the transformation we want to first apply to the RDD is the flat map transformation. Transformations come in many flavors, but the most common are as follows:

  • map() applies a function to each element in the RDD and returns an RDD of the result.
  • flatMap() is similar to a map() in that it applies a function individually to each element in the RDD. But rather than returning a single element it returns an iterator with the return values. Spark then flattens the iterators of return values into one large result.
  • filter() returns an RDD that contains only those elements that match the specified filter criteria.
  • distinct() returns an RDD with only distinct or unique elements -- it removes any duplicates.
  • union() is executed on an RDD in order to return a new RDD that contains the union (set operation) of it and another RDD. The result contains all elements from both RDDs.
  • intersection() returns an RDD that contains the intersection between two RDDs. The result contains only those elements that are in both RDDs.
  • subtract() removes the elements that are in one RDD from another RDD.
  • cartesian() computes the cartesian product between two RDDs; note that this transformation should be used very cautiously because the result could consume a lot of memory!


Previous Page  1  2  3  4  5  6  7  8  9  Next Page 

Sign up for CIO Asia eNewsletters.