Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Open source Java projects: Apache Spark

Steven Haines | Aug. 26, 2015
High-performance big data analysis with Spark!

The flatMap() transformation in Listing 2 returns an RDD that contains one element for each word, split by a space character. The flatMap() method expects a function that accepts a String and returns an Iterable interface to a collection of Strings.

WordCount in Java 8

In the Java 7 example, we create an anonymous inner class of type FlatMapFunction and override its call() method. The call() method is passed the input String and returns an Iterable reference to the results. The Java 7 example leverages the Arrays class's asList() method to create an Iterable interface to the String[], returned by the String's split() method. In the Java 8 example we use a lambda expression to create the same function without creating the anonymous inner class:


s -> Arrays.asList( s.split( " " ) )

Given an input s, this function splits s into words separated by spaces, and wraps the resultant String[] into an Iterable collection by calling Arrays.asList(). You can see that this is the exact same implementation, but it's much more succinct.

Spark transformations

At this point we have an RDD that contains all of our words. Our next step is to reduce the words RDD into a collection of RDD pairs that map each distinct word to a count of 1, then we'll count the words. The mapToPair() method iterates over every element in the RDD and executes a PairFunction on the element. The PairFunction implements a call() method that accepts an input String (the word from the previous step) and returns a Tuple2 instance.

Tuples in Java

If you're new to functional programming, a Tuple is a group or collection of data -- it can contain two elements or more. Spark is written in Scala, which supports tuples. In the scenario above, Spark creates a Tuple2 class that groups two elements together (in Scala you just create the tuple, but in Java it must be very explicit that it is a tuple that contains two elements). The call() method then returns a tuple that contains the word and the value of 1.

The reduceByKey() method iterates over the JavaPairRDD, finds all distinct keys in the tuples, and executes the provided Function2's call() method against all of the tuple's values. Stated another way, it finds all instances of the same word (such as apple) and then passes each count (each of the 1 values) to the supplied function to count occurrences of the word. Our call() function simply adds the two counts and returns the result.

Note that this is similar to how we would implement a word count in a Hadoop MapReduce application: map each word in the text file to a count of 1 and then reduce the results by adding all of the counts for each unique key.

 

Previous Page  1  2  3  4  5  6  7  8  9  Next Page 

Sign up for CIO Asia eNewsletters.