Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Open source Java projects: Apache Spark

Steven Haines | Aug. 26, 2015
High-performance big data analysis with Spark!

Execute it from the target directory with the following command:


java -jar spark-example-1.0-SNAPSHOT.jar YOUR_TEXT_FILE

Let's create a short text file to test that it works. Enter the following text into a file called test.txt:


This is a test
This is a test
The test is not over
Okay, now the test is over    

From your target folder, execute the following command:


java -jar spark-example-1.0-SNAPSHOT.jar test.txt

Spark will create an output folder with a new file called part-00000. If you look at this file, you should see the following output:


(not,1)
(The,1)
(is,4)
(a,2)
(This,2)
(over,2)
(Okay,,1)
(now,1)
(the,1)
(test,4)    

The output contains all words and the number of times that they occur. If you want to optimize the output, you might want to set all words to lower case (note the two "the"s), but I'll leave that as an exercise for you.

Finally, if you really want to take Spark for a spin, check out Project Gutenberg and download the full text of a large book. (I parsed Homer's Odyssey to test this application.)

Common actions in Spark

The saveAsTextFile() method is a Spark action. Earlier we saw transformations, which transform an RDD into another RDD, but actions generate our actual results. The following, in addition to the various save actions, are the most common actions:

  • collect() returns all elements in the RDD.
  • count() returns the number of elements in the RDD.
  • countByValue() returns the number of elements with the specified value that are in the RDD.
  • take(count) returns the requested number of elements from the RDD.
  • top(count) returns the top "count" number of elements from the RDD.
  • takeOrdered(count)(ordering) returns the specified number of elements ordered by the specified ordering function.
  • takeSample() returns a random sample of the number of requested elements from the RDD.
  • reduce() executes the provided function to combine the elements into a result set.
  • fold() is similar to reduce(), but provides a "zero value."
  • aggregate() is similar to fold() (it also accepts a zero value), but is used to return a different type that the source RDD.
  • foreach() executes the provided function on each element in the RDD, which good for things like writing to a database or publishing to a web service.

The actions listed are some of the most common that you'll use, but other actions exist, including some designed to operate different types of RDD collections. For example mean() and variance() operate on RDDs of numbers and join() operates on RDDs of key/value pairs.

Summary: Data analysis with Spark

The steps for analyzing data with Spark can be grossly summarized as follows:

  1. Obtain a reference to an RDD.
  2. Perform transformations to convert the RDD to the form you want to analyze.
  3. Execute actions to derive your result set.

 

Previous Page  1  2  3  4  5  6  7  8  9  Next Page 

Sign up for CIO Asia eNewsletters.