flatMap() transformation in Listing 2 returns an RDD that contains one element for each word, split by a space character. The
flatMap() method expects a function that accepts a
String and returns an
Iterable interface to a collection of
WordCount in Java 8
In the Java 7 example, we create an anonymous inner class of type
FlatMapFunction and override its
call() method. The
call() method is passed the input
String and returns an
Iterable reference to the results. The Java 7 example leverages the
asList() method to create an
Iterable interface to the
String, returned by the
split() method. In the Java 8 example we use a lambda expression to create the same function without creating the anonymous inner class:
s -> Arrays.asList( s.split( " " ) )
Given an input
s, this function splits
s into words separated by spaces, and wraps the resultant
String into an
Iterable collection by calling
Arrays.asList(). You can see that this is the exact same implementation, but it's much more succinct.
At this point we have an RDD that contains all of our words. Our next step is to reduce the words RDD into a collection of RDD pairs that map each distinct word to a count of 1, then we'll count the words. The
mapToPair() method iterates over every element in the RDD and executes a
PairFunction on the element. The
PairFunction implements a
call() method that accepts an input
String (the word from the previous step) and returns a
Tuples in Java
If you're new to functional programming, a Tuple is a group or collection of data -- it can contain two elements or more. Spark is written in Scala, which supports tuples. In the scenario above, Spark creates a
Tuple2 class that groups two elements together (in Scala you just create the tuple, but in Java it must be very explicit that it is a tuple that contains two elements). The
call() method then returns a tuple that contains the word and the value of 1.
reduceByKey() method iterates over the
JavaPairRDD, finds all distinct keys in the tuples, and executes the provided
call() method against all of the tuple's values. Stated another way, it finds all instances of the same word (such as apple) and then passes each count (each of the 1 values) to the supplied function to count occurrences of the word. Our
call() function simply adds the two counts and returns the result.
Note that this is similar to how we would implement a word count in a Hadoop MapReduce application: map each word in the text file to a count of 1 and then reduce the results by adding all of the counts for each unique key.
Sign up for CIO Asia eNewsletters.