How to Install Spark 3 on Windows 10

Sandipan Ghosh
4 min readMar 10, 2021

I have been using spark for a long time. It is an excellent, distributed computation framework. I use this regularly at work, and I also have it installed on my local desktop and laptop.

This document is to show the installation steps for installing spark 3+ on Windows 10 in a sudo distributed mode.

Steps:-

1. Install WSL2

a. https://docs.microsoft.com/en-us/windows/wsl/install-win10

2. Install Ubuntu 20.4 LTS from the Microsoft store.

3. Install the windows terminal form the Microsoft store. This step is optional. You can use PowerShell or MobaXterm

4. Fire up the Ubuntu from WSL

open Ubuntu terminal from windows terminal

5. Once logged in, then go to home dir “ cd ~ ”

6. For spark, we need

a. Python3

b. Java

c. Latest Scala

d. Spark with Hadoop binary zip file

  1. Let’s download and install all the prerequisite
  2. install python

sudo apt-get install software-properties-common

sudo apt-get install python-software-properties

  1. install Java (open JDK)

sudo apt-get install openjdk-8-jdk

  1. Check the java and javac version

java -version

javac -version

Check the Java and Javac version
  1. Install Scala
  2. get scala binary for Unix

wget https://downloads.lightbend.com/scala/2.13.3/scala-2.13.3.tgz

tar xvf scala-2.13.3.tgz

  1. edit bashrc file to add Scala

vi ~/.bashrc

  1. add these lines in the end

export SCALA_HOME=Path-where-scala-file-is-located#/root/scala-2.13.3

export PATH=$PATH:$SCALA_HOME/bin

  1. Once done, save and close the file
  2. Let's check the scala version

source ~/.bashrc

scala -version

check the Scala Version
  1. get the spark package
  2. I downloaded the spark from the source

wget “https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz

tar xvf spark-3.1.1-bin-hadoop3.2.tgz

vi ~/.bashrc

export SPARK_HOME=”/home/sandipan/spark-3.1.1-bin-hadoop3.2"

export PATH=$PATH:$SPARK_HOME/bin

  1. Once done, save and close the file, then run the below command to load the profile

source ~/.bashrc

  1. Start Spark Services

cd $SPARK_HOME

  1. start the master Server

./sbin/start-master.sh

  1. once you start the master server you will get a message saying it had started.
  2. you can see the spark status in the web console of master using local http://localhost:8080
  3. there you will see the master url
Get the master URL
  1. Mine looks like:- “spark://LAPTOP-7DUT93OF.localdomain:7077”
  2. we can start a slave using bellow command

SPARK_WORKER_INSTANCES=3 SPARK_WORKER_CORES=2 SPARK_WORKER_MEMORY=7G ./sbin/start-worker.sh spark://LAPTOP-7DUT93OF.localdomain:7077

  1. SPARK_WORKER_INSTANCES = how many worker instances you want to start
  2. SPARK_WORKER_CORES = how many cores per instances you want to give. Generally, I give 1 core.
  3. SPARK_WORKER_MEMORY = Memory per worker. Be very careful with this parameter. My laptop has 32 GB of memory, so I keep 3GB to 4GB for Windows, 2 GB for the driver program and the rest for the worker node.
Started worker nodes
  1. open the pyspark shell

SPARK_HOME/bin/pyspark — master spark://LAPTOP-7DUT93OF.localdomain:7077 — executor-memory 6500mb

SPARK_HOME/bin/spark-shell — master spark://LAPTOP-7DUT93OF.localdomain:7077 — executor-memory 6500mb

Starting a pyspark shell
  1. to stop all the workers

SPARK_WORKER_INSTANCES=3 SPARK_WORKER_CORES=2 ./sbin/stop-worker.sh spark://LAPTOP-7DUT93OF.localdomain:7077

  1. OR TRY

kill -9 $(jps -l | grep spark | awk -F ‘ ‘ ‘{print $1}’)

Running a few codes

let's load a CSV file with sales data and run a count check

sales_data = spark.read.csv(‘/mnt/e/Training_Data/5m-Sales-Records/5mSalesRecords.csv’).option(“header”, “true”)

sales_data.count()

running the job
Checking the count
worker nodes
Some of my jobs
Executor view

Conclusion:-

This is an extremely easy way to use the spark on your laptop or desktop running windows 10.

We can follow the same steps for Ubuntu or any other Linux distrubutions.

You can use docker for the same approach and just spin a pre-build docker file.

Please do drop an email for any suggestion or help

--

--

Sandipan Ghosh

Bigdata solution architect & Lead data engg with experience in building data-intensive applications,tackling challenging architectural and scalability problems.