Execute it from the
target directory with the following command:
java -jar spark-example-1.0-SNAPSHOT.jar YOUR_TEXT_FILE
Let's create a short text file to test that it works. Enter the following text into a file called
This is a test This is a test The test is not over Okay, now the test is over
target folder, execute the following command:
java -jar spark-example-1.0-SNAPSHOT.jar test.txt
Spark will create an
output folder with a new file called
part-00000. If you look at this file, you should see the following output:
(not,1) (The,1) (is,4) (a,2) (This,2) (over,2) (Okay,,1) (now,1) (the,1) (test,4)
The output contains all words and the number of times that they occur. If you want to optimize the output, you might want to set all words to lower case (note the two "the"s), but I'll leave that as an exercise for you.
Finally, if you really want to take Spark for a spin, check out Project Gutenberg and download the full text of a large book. (I parsed Homer's Odyssey to test this application.)
Common actions in Spark
saveAsTextFile() method is a Spark action. Earlier we saw transformations, which transform an RDD into another RDD, but actions generate our actual results. The following, in addition to the various save actions, are the most common actions:
collect()returns all elements in the RDD.
count()returns the number of elements in the RDD.
countByValue()returns the number of elements with the specified value that are in the RDD.
take(count)returns the requested number of elements from the RDD.
top(count)returns the top "count" number of elements from the RDD.
takeOrdered(count)(ordering)returns the specified number of elements ordered by the specified ordering function.
takeSample()returns a random sample of the number of requested elements from the RDD.
reduce()executes the provided function to combine the elements into a result set.
fold()is similar to
reduce(), but provides a "zero value."
aggregate()is similar to
fold()(it also accepts a zero value), but is used to return a different type that the source RDD.
foreach()executes the provided function on each element in the RDD, which good for things like writing to a database or publishing to a web service.
The actions listed are some of the most common that you'll use, but other actions exist, including some designed to operate different types of RDD collections. For example
variance() operate on RDDs of numbers and
join() operates on RDDs of key/value pairs.
Summary: Data analysis with Spark
The steps for analyzing data with Spark can be grossly summarized as follows:
- Obtain a reference to an RDD.
- Perform transformations to convert the RDD to the form you want to analyze.
- Execute actions to derive your result set.
Sign up for CIO Asia eNewsletters.