Article : Data Integration & Trading software Open Source
Data Integration & Trading software Open Source
Examples, Development, Resources
• Tutorial
• Resources
• Contacts & Support
Apache Spark example with Java and Maven
December 21 2014
Apache Spark has all the credentials to become the next cool Big Data technology. Apache Spark is designed to work seamlessly with Hadoop or as a standalone application. The big advantage of Sparkstandalone is its ease of use and it is an ideal enviroment for testing propose.
Project structure
This example consists of a pom.xml file and a WorkCount.java file. An additional test file with some random data is also present with the name loremipsum.txt. The files can be found on GitHub https://github.com/melphi/spark-examples/tree/master/first-example.
spark-examples
|-- pom.xml
`-- src
|-- main/java/org/spark-example/WordCount.java
`-- test/resources/loremipsum.txt
Maven configuration
This is the Maven pom.xml configuration file, as you can see we need to import the spark-core library. The optional maven-compile-plugin is used to compile the project directly from Maven.
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
4.0.0
org.sparkexamples
first-example
1.0-SNAPSHOT
org.apache.spark
spark-core_2.10
1.2.0
org.apache.maven.plugins
maven-compiler-plugin
3.1
Java application
The WordCount.java is a simple Java Spark application which counts the number of words of the file passed as input argument.
package org.sparkexample;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;
import java.util.Arrays;
public class WordCount {
private static final FlatMapFunction
WORDS_EXTRACTOR =
new FlatMapFunction() {
@Override
public Iterable call(String s) throws Exception {
return Arrays.asList(s.split(" "));
}
};
private static final PairFunction WORDS_MAPPER =
new PairFunction() {
@Override
public Tuple2 call(String s) throws Exception {
return new Tuple2(s, 1);
}
};
private static final Function2 WORDS_REDUCER =
new Function2() {
@Override
public Integer call(Integer a, Integer b) throws Exception {
return a + b;
}
};
public static void main(String[] args) {
if (args.length < 1) {
System.err.println("Please provide the input file full path as argument");
System.exit(0);
}
SparkConf conf = new SparkConf().setAppName("org.sparkexample.WordCount").setMaster("local");
JavaSparkContext context = new JavaSparkContext(conf);
JavaRDD file = context.textFile(args[0]);
JavaRDD words = file.flatMap(WORDS_EXTRACTOR);
JavaPairRDD pairs = words.mapToPair(WORDS_MAPPER);
JavaPairRDD counter = pairs.reduceByKey(WORDS_REDUCER);
counter.saveAsTextFile(args[1]);
}
}
Setup the Spark environment
Apache Spark can run as standalone application or installed in an Hadoop environment. Spark can be dowloaded at https://spark.apache.org/downloads.html. For this example we will use the last available release for the package type "Prebuilt for Hadoop 2.4", however any other version should work without any change. This is how the download page should look like:
Once downloaded and decompressed the package is ready to be used (an Hadoop installation is not required for the standalone execution), the only requirement is to have a Java virtual machine installed.
Run the Java application on Apache Spark
The Java application needs to be compiled first before executing it on Apache Spark. To compile the Java application from Maven:
1. open the command line and move to the root maven project with "cd /"
2. execute the command "mvn package". Maven must be installed on the system path, otherwise the command mvn will not be recognized. Refer to the maven documentation on how to set up maven properly.
3. maven will build the java file and save it on the target directory //target/first-example-1.0-SNAPSHOT.jar
Once we have build the Java application first-example-1.0-SNAPSHOT.jar we can execute it locally on Apache Spark, this makes the entire testing process very easy.
On a command shell move to the spark installation directory and use the following command:
./bin/spark-submit --class org.sparkexample.WordCount --master local[2] //target/spark-examples-1.0-SNAPSHOT.jar / /
where
• "--class org.sparkexample.WordCount" is the main Java class with the public static void main method
• "--master local[2]" starts the cluster locally using 2 CPU cores
• is the path to our maven project
• is a demo local file which contains some words, an example file can be downloaded at https://github.com/melphi/spark-examples/blob/master/first-example/src/test/resources/loremipsum.txt
• is the directory where the resuls should be saved
If everything is fine the output should be similar to the following image and we should find the file part-00000 in the output directory.
All the files used in this tutorial can be found at https://github.com/melphi/spark-examples/tree/master/first-example.