Posted In: Apache, Spark

How to set up Spark on Windows

There are two options to run Spark. Run it on spark-shell or through Java Code in Eclipse

1. spark-shell – Software required

1. Install JDK 1.8
2. Download spark-2.2.0-bin-hadoop2.7.tgz
3. Download winutils.exe
4. Create /tmp/hive/ folder.
5. Add or set commands
4. Run spark-shell

Step 1 – Install JDK and set JAVA_HOME

Step 2 – Download Spark and unzip in a folder

Step 3 – Download winutils.exe and put it in /path/bin/winutils.exe

Step 4 – Create /tmp/hive/ folder. Need to run chmod for this folder.

Step 5 – Now open /path/spark/bin/spark-shell.cmd. Add following lines


set HADOOP_HOME=E:/programs/winutils
E:\programs\winutils\bin\winutils.exe chmod 777 E:\tmp\hive

Step 6 – Run spark-shell

Verify with following.


spark.range(1).withColumn("status", lit("Hello world!")).show(false)

 

 

2. Eclipse – Java code

1. Install JDK 1.8
2. Install Eclipse
3. Download winutils.exe
4. Add Maven entry for Spark
5. Run Java code

Step 1 – Install JDK and set JAVA_HOME

Step 2 – Install Eclipse

Step 3 – Download winutils.exe and put it in /path/bin/winutils.exe

Step 4 – Create Maven project and add following to pom.xml

<dependency>
	<groupId>org.apache.spark</groupId>
	<artifactId>spark-core_2.11</artifactId>
	<version>2.2.0</version>
</dependency>
<dependency>
	<groupId>org.apache.spark</groupId>
	<artifactId>spark-sql_2.11</artifactId>
	<version>2.2.0</version>
</dependency>

Step 5 – Java code to run find counts from file

Example uses India stock market EOD price data available here

package com.example;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.SparkSession;

public class App {
	public static void main(String[] args) {
		System.setProperty("HADOOP_HOME", "E:/programs/winutils");
		System.setProperty("hadoop.home.dir", "E:/programs/winutils");
		String logFile = "C:/Users/trupti/Downloads/cm24AUG2017bhav.csv";

		SparkConf sparkConf = new SparkConf();
		sparkConf.setAppName("Hello Spark");
		sparkConf.setMaster("local");

		JavaSparkContext context = new JavaSparkContext(sparkConf);

		SparkSession spark = SparkSession.builder().appName("Simple Application").getOrCreate();
		Dataset<String> logData = spark.read().textFile(logFile).cache();

		long num1 = logData.filter(s -> s.contains("RELIANCE")).count();
		long num2 = logData.filter(s -> s.endsWith("INE027A01015,")).count();

		System.out.println("Lines with RELIANCE: " + num1 + ", lines  ends with INE027A01015,: " + num2);
		spark.stop();
		context.stop();
		context.close();
	}
}

Common errors

Solution – Download winutils and set HADOOP_HOME


17/08/27 18:27:03 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
	at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:378)
	at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:393)
	at org.apache.hadoop.util.Shell.(Shell.java:386)

Solution – Start spark context JavaSparkContext context = new JavaSparkContext(sparkConf);


17/08/27 19:37:05 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration
	at org.apache.spark.SparkContext.(SparkContext.scala:376)
	at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)
	at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:909)

Solution – Create /tmp/hive and give access \path\bin\winutils.exe chmod 777 E:\tmp\hive


caused by: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: 
java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: ---------;

by , on August 28th, 2017

  • Categories