Install PySpark 3 on Google Colab the Easy Way

Install PySpark 3 on Google Colab the Easy Way

This tutorial will talk about how to set up the Spark environment on Google Colab. Both the manual method (the not-so-easy way) and the automated method (the easy way) will be covered.

Resources for this post:

Install PySpark on Google Colab – GrabNGoInfo.com

Let’s get started!

Method 1: Manual Installation – the Not-so-easy Way

Firstly, let’s talk about how to install Spark on Google Colab manually.

Step 1.1: Download Java because Spark requires Java Virtual Machine (JVM).

# Download Java Virtual Machine (JVM)
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

Step 1.2: Download the latest version of the Apache Spark following the steps below:

  1. Go to https://spark.apache.org/downloads.html
  2. Choose a Spark release version and a package type. The default is the latest version. When this tutorial is published, the latest Spark release is 3.2.1, and the package type is Pre-built for Apache Haddoop 3.3 and later.
  3. Click the link for downloading Spark (the blue link on spark-3.2.1-bin-hadoop3.2.tgz), and you will be directed to a new web page.
  4. Copy the first link on the web page (https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz), which is below the sentence “We suggest the following site for your download:”.
  5. Download Spark from the copied link.
  6. Unzip the file
# Download Spark
!wget -q https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz

# Unzip the file
!tar xf spark-3.2.1-bin-hadoop3.2.tgz

Step 1.3: Set up the environment for Spark.

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = '/content/spark-3.2.1-bin-hadoop3.2'

Step 1.4: Install and import the library for locating Spark.

# Install library for finding Spark
!pip install -q findspark

# Import the libary
import findspark

# Initiate findspark
findspark.init()

# Check the location for Spark
findspark.find()

Output:

/content/spark-3.2.1-bin-hadoop3.2

Step 1.5: Start a Spark session, and check the session information.

# Import SparkSession
from pyspark.sql import SparkSession

# Create a Spark Session
spark = SparkSession.builder.master("local[*]").getOrCreate()

# Check Spark Session Information
spark
Spark Session – GrabNGoInfo.com

Step 1.6: Test if Spark is installed successfully by importing a Spark library.

# Import a Spark function from library
from pyspark.sql.functions import col

There is no error message after running the code, indicating that Spark is successfully installed.

Method 2: Automatic Installation – the Easy Way

The second method of installing PySpark on Google Colab is to use pip install.

# Install pyspark
!pip install pyspark

After installation, we can create a Spark session and check its information.

# Import SparkSession
from pyspark.sql import SparkSession

# Create a Spark Session
spark = SparkSession.builder.master("local[*]").getOrCreate()

# Check Spark Session Information
spark

We can also test the installation by importing a Spark library.

# Import a Spark function from library
from pyspark.sql.functions import col

Which method to use?

You might wonder which method to use for your project. I suggest using the pip install (the easy way) in most cases, and only consider the manual method if you would like to customize certain settings for the installation.

For more information about data science and machine learning, please check out my YouTube channel and Medium Page or follow me on LinkedIn.

Put All Code Together

#-----------------------------------------------------#
# Method 1: Manual Installation - the Not-so-easy Way
#-----------------------------------------------------#

# Download Java Virtual Machine (JVM)
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# Download Spark
!wget -q https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz

# Unzip the file
!tar xf spark-3.2.1-bin-hadoop3.2.tgz

# Set up the enviornment
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = '/content/spark-3.2.1-bin-hadoop3.2'

# Install library for finding Spark
!pip install -q findspark

# Import the libary
import findspark

# Initiate findspark
findspark.init()

# Check the location for Spark
findspark.find()

# Import SparkSession
from pyspark.sql import SparkSession

# Create a Spark Session
spark = SparkSession.builder.master("local[*]").getOrCreate()

# Check Spark Session Information
spark

# Import a Spark function from library
from pyspark.sql.functions import col

#-----------------------------------------------------#
# Method 2: Automatic Installation - the Easy Way
#-----------------------------------------------------#

# Install pyspark
!pip install pyspark

# Import SparkSession
from pyspark.sql import SparkSession

# Create a Spark Session
spark = SparkSession.builder.master("local[*]").getOrCreate()

# Check Spark Session Information
spark

# Import a Spark function from library
from pyspark.sql.functions import col

Recommended Tutorials

References

3 thoughts on “Install PySpark 3 on Google Colab the Easy Way”

  1. Perfect article I came across after going through multiple articles on installing and verifying pyspark on Google Colab.

  2. I don’t understand why I keep getting this error:
    “[Errno 2] No such file or directory: ‘/content/spark-3.2.1-bin-hadoop3.2/./bin/spark-submit'”

    I copied and pasted your code exactly. This happened a few times before that I thought related to the fact that my runtime type/hardware accelerator was not set to GPU, but it happens regardless. What do I need to do?

    PS I have mounted and enabled access to my google drive.

Leave a Comment

Your email address will not be published. Required fields are marked *