Iterating the RDD with Java 8
Once again, the Java 8 example performs all of these same functions, but does it in a single, succinct line of code:
JavaPairRDD<String, Integer> counts = words.mapToPair( t -> new Tuple2( t, 1 ) ).reduceByKey( (x, y) -> (int)x + (int)y );
mapToPair() method is passed a function. Given a
t, which in our case is a word, the function returns a
Tuple2 that maps the word to the number 1. We then chain the
reduceByKey() method and pass it a function that reads: given input
y, return their sum. Note that we needed to cast the input to
int so that we could perform the addition operation.
Transformations for key/value pairs
When writing Spark applications you will find yourself frequently working with pairs of elements so Spark provides a set of common transformations that can be applied specifically to key/value pairs:
reduceByKey(function)combines values with the same key using the provided function.
groupByKey()maps unique keys to an array of values assigned to that key.
combineByKey()combines values with the same key, but uses a different result type.
mapValues(function)applies a function to each of the values in the RDD.
flatMapValues(function)applies a function to all values, but in this case the function returns an iterator to each newly generated value. Spark then creates new pairs for each value mapping the original key to each of the generated values.
keys()returns an RDD that contains only the keys from the pairs.
values()returns an RDD that contains only the values from the pairs.
sortByKey()sorts the RDD by the key value.
subtractByKey(another RDD)removes all element from the RDD for which there is a key in the other RDD.
join(another RDD)performs an inner join between the two RDDs; the result will only contain keys present in both and those keys will be mapped to all values in both RDDs.
rightOuterJoin(another RDD)performs a right outer join between the two RDDs in which all keys must be present in the other RDD.
leftOuterJoin(another RDD)performs a left outer join between the two RDDs in which all of the keys must be present in the original RDD.
cogroup(another RDD)groups data from both RDDs that have the same key.
After we have transformed our RDD pairs, we invoke the
saveAsTextFile() action method on the
JavaPairRDD to create a directory called
output and store the results in files in that directory.
With our data compiled into the format that we want, our final step is to build and execute the application, then use Spark actions to derive some results sets.
Actions in Spark
WordCount application with the following command:
mvn clean install
Sign up for CIO Asia eNewsletters.