Apache Spark-Java Project Setup

Overview

 

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark.RDD is a read-only, partitioned collection of records.It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.

There are two ways to create RDDs − parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.

Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations.

 

Prerequisites

  • Java  : Ensure if Java is installed. Before installing Spark, Java is a must have for your system.
  • Maven 3 : Just to automate collecting the project dependencies.
  • Eclipse : IDE for Java/JavaEE developments.

 

Project Setup

 Create a Java Maven project from eclipse IDE

 Update pom.xml with the following details

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

<modelVersion>4.0.0</modelVersion>

<groupId>com.yournxt</groupId>

<artifactId>spark-java</artifactId>

<version>0.0.1-SNAPSHOT</version>

<properties>

<java.version>1.8</java.version>

</properties>

<dependencies>

<dependency>

<groupId>org.apache.spark</groupId>

<artifactId>spark-core_2.10</artifactId>

<version>2.0.0</version>

</dependency>

</dependencies>

<build>

<pluginManagement>

<plugins>

<plugin>

<groupId>org.apache.maven.plugins</groupId>

<artifactId>maven-compiler-plugin</artifactId>

<version>3.1</version>

<configuration>

<source>${java.version}</source>

<target>${java.version}</target>

</configuration>

</plugin>

</plugins>

</pluginManagement>

</build>

</project>

Right click on Java project and run as Maven build with goal “install”.This will download and install all required  Spark dependencies in .m2 in your system

 

Local Spark Context Creation

The benefit of creating a local Spark context is the possibility to run everything locally without being in need of deploying Spark Server separately as a master. This is very interesting while development phase. So here it is the basic configuration :

package com.yournxt.spark.test.transformations;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;

public class SparkSample {

public static void main(String[] args) {

SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("Hello Spark");
sparkConf.setMaster("local");
JavaSparkContext context = new JavaSparkContext(sparkConf);
context.close();

}

}

 

Running above program as Java Application will create Spark context and will complete the basic Spark Java setup.

Spark Actions basic program

package com.yournxt.spark.test.actions;

import java.util.Arrays;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class ActionSample {

public static void main(String[] args) {

SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("Hello Spark");
sparkConf.setMaster("local");
JavaSparkContext context = new JavaSparkContext(sparkConf);
JavaRDD<Integer> numbersRDD = context.parallelize(Arrays.asList(0,5,3));
List<Integer> numbersList = numbersRDD.collect();
System.out.println(numbersList.toString());
context.close();

}

}