Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Open source Java projects: Apache Spark

Steven Haines | Aug. 26, 2015
High-performance big data analysis with Spark!

Big data adoption has been growing by leaps and bounds over the past few years, which has necessitated new technologies to analyze that data holistically. Individual big data solutions provide their own mechanisms for data analysis, but how do you analyze data that is contained in Hadoop, Splunk, files on a file system, a local database, and so forth?

The answer is that you need an abstraction that can pull data from all of these sources and analyze potentially petabytes of information very rapidly.

Spark is a computational engine that manages tasks across a collection of worker machines in what is called a computing cluster. It provides the necessary abstraction, integrates with a host of different data sources, and analyzes data very quickly. This installation in the Open source Java projects series reviews Spark, describes how to set up a local environment, and demonstrates how to use Spark to derive business value from your data.

Counting words with Spark

Let's begin by writing a simple word-counting application using Spark in Java. After this hands-on demonstration we'll explore Spark's architecture and how it works.

Similar to the standard "Hello, Hadoop" application, the "Hello, Spark" application will take a source text file and count the number of unique words that are in it. To start, create a new project using Maven with the following command:


mvn archetype:generate -DgroupId=com.geekcap.javaworld -DartifactId=spark-example

Next, modify your pom.xml file to include the following Spark dependency:


<dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.11</artifactId>
        <version>1.4.0</version>
</dependency>		

Listing 1 shows the complete contents of my pom.xml file.

Listing 1. pom.xml


<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.geekcap.javaworld</groupId>
    <artifactId>spark-example</artifactId>
    <packaging>jar</packaging>
    <version>1.0-SNAPSHOT</version>
    <name>spark-example</name>
    <url>http://maven.apache.org</url>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>2.0.2</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-jar-plugin</artifactId>
                <configuration>
                    <archive>
                        <manifest>
                            <addClasspath>true</addClasspath>
                            <classpathPrefix>lib/</classpathPrefix>
                            <mainClass>com.geekcap.javaworld.sparkexample.WordCount</mainClass>
                        </manifest>
                    </archive>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-dependency-plugin</artifactId>
                <executions>
                    <execution>
                        <id>copy</id>
                        <phase>install</phase>
                        <goals>
                            <goal>copy-dependencies</goal>
                        </goals>
                        <configuration>
                            <outputDirectory>${project.build.directory}/lib</outputDirectory>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

    <dependencies>

        <!-- Import Spark -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>1.4.0</version>
        </dependency>

        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.11</version>
            <scope>test</scope>
        </dependency>
    </dependencies>
</project>	

 

1  2  3  4  5  6  7  8  9  Next Page 

Sign up for CIO Asia eNewsletters.