How to Install Apache Spark on Ubuntu

This brief tutorial shows students and new users how to install Apache Spark on Ubuntu 20.04 | 18.04.

Apache Spark is an open source framework packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning and graph processing.

It is also capable of analyzing a large amount of data and distributed across clusters and processes the data in parallel.

If you are a developer who needs to produce seamless and create complex workflows, then Apache Spark is a great place to start.

Getting started with installing Apache Spark on Ubuntu.

Install Java JDK

Apache Spark requires Java JDK. In Ubuntu the commands below can install the latest version.

sudo apt update
sudo apt install default-jdk

After installing, run the commands below to verify the version of Java installd.

java --version

That should display similar lines as shown below:

openjdk 11.0.10 2021-01-19
OpenJDK Runtime Environment (build 11.0.10+9-Ubuntu-0ubuntu1.20.04)
OpenJDK 64-Bit Server VM (build 11.0.10+9-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)

Install Scala

One package that you’ll also need to run Apache Spark is Scala. To install in Ubuntu, simply run the commands below:

sudo apt install scala

To verify the version of Scala installed, run the commands below:

scala -version

Doing that will display similar line below:

Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL

Install Apache Spark

Now that you have installed required packages to run Apache Spark, continue below to install it.

Run the commands below to download the latest version.

cd /tmp
wget 

Next, extract the downloaded file and move it to the /opt directory.

tar -xvzf spark-2.4.6-bin-hadoop2.7.tgz
sudo mv spark-2.4.6-bin-hadoop2.7 /opt/spark

Next, create environment variables to be able to execute and run Spark.

nano ~/.bashrc

Then add the lines at the bottom of the file and save.

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

After that, run the commands below to apply your environment changes.

source ~/.bashrc

Start Apache Spark

At this point, Apache Spark is installed and ready to use. Run the commands below to start it up.

start-master.sh

Next, start Spark work process by running the commands below.

start-slave.sh spark://localhost:7077

You can replace localhost host with the server hostname or IP address. When the process start, open your browser and browse to the server hostname or IP address.


If you wish to connect to Spark via its command shell, run the commands below:

spark-shell

The commands above will launch Spark shell.

Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.6
      /_/
         
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 11.0.10)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 

That should do it!

Conclusion:

This post showed you how to install Apache Spark on Ubuntu 20.04 | 18.04. If you find any error above, please use the form below to report.