Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Open source Java projects: Apache Spark

Steven Haines | Aug. 26, 2015
High-performance big data analysis with Spark!

Iterating the RDD with Java 8

Once again, the Java 8 example performs all of these same functions, but does it in a single, succinct line of code:


JavaPairRDD<String, Integer> counts = words.mapToPair( t -> new Tuple2( t, 1 ) ).reduceByKey( (x, y) -> (int)x + (int)y );

The mapToPair() method is passed a function. Given a String t, which in our case is a word, the function returns a Tuple2 that maps the word to the number 1. We then chain the reduceByKey() method and pass it a function that reads: given input x and y, return their sum. Note that we needed to cast the input to int so that we could perform the addition operation.

Transformations for key/value pairs

When writing Spark applications you will find yourself frequently working with pairs of elements so Spark provides a set of common transformations that can be applied specifically to key/value pairs:

  • reduceByKey(function) combines values with the same key using the provided function.
  • groupByKey() maps unique keys to an array of values assigned to that key.
  • combineByKey() combines values with the same key, but uses a different result type.
  • mapValues(function) applies a function to each of the values in the RDD.
  • flatMapValues(function) applies a function to all values, but in this case the function returns an iterator to each newly generated value. Spark then creates new pairs for each value mapping the original key to each of the generated values.
  • keys() returns an RDD that contains only the keys from the pairs.
  • values() returns an RDD that contains only the values from the pairs.
  • sortByKey() sorts the RDD by the key value.
  • subtractByKey(another RDD) removes all element from the RDD for which there is a key in the other RDD.
  • join(another RDD) performs an inner join between the two RDDs; the result will only contain keys present in both and those keys will be mapped to all values in both RDDs.
  • rightOuterJoin(another RDD) performs a right outer join between the two RDDs in which all keys must be present in the other RDD.
  • leftOuterJoin(another RDD) performs a left outer join between the two RDDs in which all of the keys must be present in the original RDD.
  • cogroup(another RDD) groups data from both RDDs that have the same key.

After we have transformed our RDD pairs, we invoke the saveAsTextFile() action method on the JavaPairRDD to create a directory called output and store the results in files in that directory.

With our data compiled into the format that we want, our final step is to build and execute the application, then use Spark actions to derive some results sets.

Actions in Spark

Build the WordCount application with the following command:


mvn clean install

 

Previous Page  1  2  3  4  5  6  7  8  9  Next Page 

Sign up for CIO Asia eNewsletters.