Five Ways To Create Tables In Databricks

Five Ways To Create Tables In Databricks

Databricks supports managed and unmanaged tables. Unmanaged tables are also called external tables. This tutorial demonstrates five different ways to create tables in Databricks. It covers:

  • What’s the difference between managed and external tables?
  • How to mount S3 bucket to Databricks and read CSV to spark dataframe?
  • How to create a database in Databricks?
  • How to create a managed table from a spark datafram using pySpark?
  • How to create a managed table from an existing table using SQL?
  • How to create a managed table from a spark datafram using SQL?
  • How to create an unmanaged table from a spark datafram using pySpark?
  • How to create an unmanaged table from a spark datafram using SQL?
  • How to delete managed and external tables?

This tutorial uses Python as the default Databricks notebook language. The magic command %sql is used when a SQL command is needed.

Resources for this post:

Databricks Five Ways To Create Tables – GrabNGoInfo.com

Step 1: Managed vs. Unmanaged Tables

In step 1, let’s understand the difference between managed and external tables.

  • Managed Tables
    • Data management: Spark manages both the metadata and the data
    • Data location: Data is saved in the Spark SQL warehouse directory /user/hive/warehouse. Metadata is saved in a meta-store of relational entities.
    • Data deletion: The metadata and the data will be deleted after deleting the table.
  • Unmanaged/External Tables
    • Data management: Spark manages only the metadata, and the data itself is not controlled by spark.
    • Data location: Source data location is required to create a table.
    • Data deletion: Only the metadata will be deleted. The tables saved in the external location.

Step 2: Mount S3 Bucket And Read CSV To Spark Dataframe

In step 2, we read in a CSV file from S3. To learn about how to mount an S3 bucket to Databricks, please refer to my tutorial Databricks Mount To AWS S3 And Import Data for a complete guide.

We first need to import libraries. pyspark.sql.functions has the functions for pySpark. urllib is the package for handling URLs.

# pyspark functions
from pyspark.sql.functions import *

# URL processing
import urllib

Next, let’s read the CSV file with AWS keys to Databricks. We specify the file type to be CSV, indicating that the file has the first row as the header and the comma as the delimiter. Then the path of the CSV file was passed in to load the file.

# Define file type
file_type = "csv"

# Whether the file has a header
first_row_is_header = "true"

# Delimiter used in the file
delimiter = ","

# Read the CSV file to spark dataframe
aws_keys_df = spark.read.format(file_type)\
.option("header", first_row_is_header)\
.option("sep", delimiter)\
.load("/FileStore/tables/tutorial_demo.csv")

After that, we get the access key and secret key from the spark dataframe. The secret key was encoded using urllib.parse.quote for security purposes. safe="" means every character in the secret key is encoded.

# Get the AWS access key and secret key from the spark dataframe
ACCESS_KEY = aws_keys_df.where(col('User name')=='tutorial_demo').select('Access key ID').collect()[0]['Access key ID']
SECRET_KEY = aws_keys_df.where(col('User name')=='tutorial_demo').select('Secret access key').collect()[0]['Secret access key']

# Encode the secrete key
ENCODED_SECRET_KEY = urllib.parse.quote(string=SECRET_KEY, safe="")

After getting the access key and secret key, it’s time to mount the S3 bucket! We can mount the bucket by passing in the S3 URL and the desired mount name to dbutils.fs.mount(). It returns Ture if the bucket is mounted successfully.

# AWS S3 bucket name
AWS_S3_BUCKET = "demo4tutorial"

# Mount name for the bucket
MOUNT_NAME = "/mnt/demo4tutorial"

# Source url
SOURCE_URL = "s3n://{0}:{1}@{2}".format(ACCESS_KEY, ENCODED_SECRET_KEY, AWS_S3_BUCKET)

# Mount the drive
dbutils.fs.mount(SOURCE_URL, MOUNT_NAME)

After checking the contents in the bucket using %fs ls, we can see that there are two folders in the bucket, data, and output.

%fs ls "/mnt/demo4tutorial"

The folder data has the dataset we need for this tutorial.

%fs ls "/mnt/demo4tutorial/data"

Next, let’s read the dataset from the S3 bucket to the spark dataframe. We set the delimiter to be a comma, indicate that the first row is the header, and ask spark to infer the schema.

# File location and type
file_location = "/mnt/demo4tutorial/data/crypto_100k_records.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

# Take a look at the data
display(df)

Alternatively, we can use .csv to read CSV files.

# Use .csv to import CSV file
df = spark.read \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .csv(file_location)

# Take a look at the data
display(df)

To manually define schema when reading in the data, we can use StructType to assign data types to columns.

# Import libraries
from pyspark.sql.types import LongType, StringType, FloatType, DoubleType, DecimalType, StructType, StructField

# User-defined schema
userDefinedSchema = StructType([
  StructField("timestamp", LongType(), True), # LongType: Represents 8-byte signed integer numbers. The range of numbers is from # -9223372036854775808 to 9223372036854775807.
  StructField("Asset_id", StringType(), True),
  StructField("Count", FloatType(), True),   # FloatType: Represents 4-byte single-precision floating point numbers.
  StructField("Open", FloatType(), True),   
  StructField("High", DoubleType(), True),   # DoubleType: Represents 8-byte double-precision floating point numbers.
  StructField("Low", DoubleType(), True),       
  StructField("Close", DoubleType(), True),   
  StructField("Volume", DoubleType(), True),     
  StructField("VWAP", DoubleType(), True),   
  StructField("Target", DoubleType(), True)   
])

# Create spark dataframe
df = spark.read.format(file_type) \
  .schema(userDefinedSchema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

# Take a look at the data
display(df)

An alternative way to StructType is using a DDL formatted string.

# Define DDL schema
DDLSchema = 'timestamp long, Asset_id string, Count float, Open float, High double, Low double, Close double, Volume double, VWAP double, Target double'

# Create spark dataframe
df = spark.read.format(file_type) \
  .schema(DDLSchema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

# Take a look at the data
display(df)

Step 3: Create Database In Databricks

In step 3, we will create a new database in Databricks. The tables will be created and saved in the new database. Using the SQL command CREATE DATABASE IF NOT EXISTS, a database called demo is created. SHOW DATABASES shows all the databased in Databricks. There are two databases available, the database named demo is what we just created, and Databricks automatically created the database named ‘default’.

%sql
-- Create database
CREATE DATABASE IF NOT EXISTS demo;

-- Show all available databases
SHOW DATABASES

The SQL command DESCRIBE DATABASE shows that the database demo is saved under dbfs:/user/hive/warehouse, and the owner is root.

%sql
-- Describe database information
DESCRIBE DATABASE demo;

Using SELECT CURRENT_DATABASE(), we can see that the current database is default

%sql
-- Check the current database
SELECT CURRENT_DATABASE();

After changing the default database to demo, we can see that the current database shows demo now.

%sql
-- Change the current database
USE demo;

-- Check the current database
SELECT CURRENT_DATABASE();

To check the tables in the database, we can use the SHOW TABLES IN SQL command. It does not show any results because we have not created any tables yet.

%sql
-- CHeck tables in a database
SHOW TABLES IN demo

Step 4: Create Managed Table From Spark Dataframe Using pySpark (Method 1)

In step 4, we will create a managed table using pyspark. The spark dataframe is saved as a table named demo.crypto_1 in delta format. Using the table name without the database name demo will give us the same results because demo has been set as the default database.

The best practice is to write results to a Delta table. Data in Delta tables is stored in Parquet format but has an additional layer on top of it with advanced features.

If you got the error message AnalysisException: Can not create the managed table('demo.crypto_1'). The associated location('dbfs:/user/hive/warehouse/demo.db/crypto_1') already exists., it’s because the table was created in the same location before. We can use spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true") before creating the table to handle the error message.

# Allow creating table using non-emply location if the table has been created before
spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true")
 
# Create table
df.write.format("delta").mode("overwrite").saveAsTable("demo.crypto_1")

After creating the table, we can see that the table crypto_1 is in the database demo.

%sql
-- Specify current database
USE demo;

-- Show tables in current database
SHOW TABLES;

The SQL code DESCRIBE EXTENDED provides information about the table. It has the column name, column data type, comments, as well as detailed table information.

By default, the table is saved under dbfs:/user/hive/warehouse/, but we can change it to a different location.

%sql
-- Get table information
DESCRIBE EXTENDED crypto_1;

We can see that the type of table is MANAGED. The type of the table can also be checked using the command spark.catalog.listTables(). It shows that the table type is MANAGED as well.

# List table information
spark.catalog.listTables()
Out[7]: [Table(name='crypto_1', database='demo', description=None, tableType='MANAGED', isTemporary=False)]

Step 5: Create Managed Table From Existing Table Using SQL (Method 2)

In step 5, we will create a managed table from an existing managed table using SQL. SQL queries can be run directly on the existing tables in the database.

%sql
-- SQL query to create table
CREATE OR REPLACE TABLE demo.crypto_2 AS
SELECT * 
FROM demo.crypto_1
WHERE Asset_ID = 1

Now the database demo has two tables, crypto_1 and crypto_2.

%sql
-- Specify current database 
USE demo;

-- Show tables in the current database
SHOW TABLES;

The describing results show that crypto_2 is a managed table as well.

%sql
-- Describe table information
DESCRIBE EXTENDED crypto_2;

Step 6: Create Managed Table From Spark Dataframe Using SQL (Method 3)

In step 6, we will create a managed table from a spark dataframe using SQL. SQL code does not work on spark dataframe directly, so we need to create a view for the dataframe and run SQL code on the view.

Using the code createOrReplaceTempView, a temp view is created for the spark dataframe.

# Create a temp view
df.createOrReplaceTempView('df')

Next, a table named crypto_3 is created by querying the temp view.

%sql
-- Create table using SQL query
CREATE OR REPLACE TABLE crypto_3 AS
SELECT * FROM df

Using the magic command %sql is equivalent to using the spark SQL code.

# Use spark.sql to run SQL query
spark.sql(
'''
CREATE OR REPLACE TABLE crypto_3 AS
SELECT * FROM df
'''
)

An alternative way is to define a table with column names and column types, then insert the data into the table.

%sql
-- Create an empty table with column names and their types
CREATE OR REPLACE TABLE crypto_3 (
timestamp INT,
Asset_ID INT,
Count INT,
Open DOUBLE,
High DOUBLE,
Low DOUBLE,
Close DOUBLE,
Volume DOUBLE,
VWAP DOUBLE,
Target DOUBLE
);

-- Insert data into the table
INSERT INTO crypto_3
SELECT * FROM df

From the output of spark.catalog.listTables(), we can see that the three tables we created so far are all managed tables. And df is a temporary table.

# List all tables and their information
spark.catalog.listTables()
Out[10]: [Table(name='crypto_1', database='demo', description=None, tableType='MANAGED', isTemporary=False),
 Table(name='crypto_2', database='demo', description=None, tableType='MANAGED', isTemporary=False),
 Table(name='crypto_3', database='demo', description=None, tableType='MANAGED', isTemporary=False),
 Table(name='df', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

Step 7: Create Unmanaged Table From Spark Dataframe Using pySpark (Method 4)

In step 7, we will create an unmanaged table from a spark dataframe using pySpark. The syntax is very similar to the syntax for creating a managed table. The only difference is that we specify the table location as an external folder in the S3 bucket.

# Create external table 
df.write.format("delta").mode("overwrite").option("path", "/mnt/demo4tutorial/output/crypto_4").saveAsTable("crypto_4")

After the table is created, we can see that the S3 bucket has a new folder called crypto_4.

%fs ls "/mnt/demo4tutorial/output"

Step 8: Create Unmanaged Table From Spark Dataframe Using SQL (Method 5)

In step 8, we will create an external table from a spark dataframe using SQL. SQL code does not work on spark dataframe directly, so we need to create a view for the dataframe and run SQL code on the view.

Using the code createOrReplaceTempView, a temp view is created for the spark dataframe. There is no need to recreate the temp view if you have created it in step 6.

# Create a temp view
df.createOrReplaceTempView("df")

The only difference between creating a managed and external table using SQL is the LOCATION. When an external location is specified in the SQL code, an unmanaged table will be created.

%sql
-- Create an external table
DROP TABLE IF EXISTS demo.crypto_5; 
CREATE TABLE demo.crypto_5 
USING delta
LOCATION "/mnt/demo4tutorial/output/crypto_5/"
SELECT * FROM df

We can also use OPTIONS to specify the path.

%sql
-- Create an external table
DROP TABLE IF EXISTS demo.crypto_5; 
CREATE TABLE demo.crypto_5 
USING delta
OPTIONS (path "/mnt/demo4tutorial/output/crypto_5/")
SELECT * FROM df

If defining the column types is needed when creating the table, we can create an empty table to define column types first, then insert data into the table.

%sql
-- Create an external table using defined column types
DROP TABLE IF EXISTS demo.crypto_5; 
CREATE TABLE demo.crypto_5 (
timestamp INT,
Asset_id STRING,
Count INT,
Open DOUBLE,
High DOUBLE,
Low DOUBLE,
Close DOUBLE,
Volume DOUBLE,
VWAP DOUBLE,
Target DOUBLE
)
USING delta
LOCATION "/mnt/demo4tutorial/output/crypto_5/";

INSERT INTO demo.crypto_5
SELECT * FROM df

Step 9: Delete Managed And Unmanaged Tables

In step 9, we will talk about how to delete managed and unmanaged tables in Databricks.

Firstly, let’s check the tables we created in the database called demo. We can see that all five tables are in the database. The temp view df is also saved in the demo database.

%sql
-- Show tables
SHOW TABLES IN demo

spark.catalog.listTables shows that the tables crypto_1, crypto_2, and crypto_3 are managed tables. And the table crypto_4 and crypto_5 are external tables.

# List table information
spark.catalog.listTables()
Out[14]: [Table(name='crypto_1', database='demo', description=None, tableType='MANAGED', isTemporary=False),
 Table(name='crypto_2', database='demo', description=None, tableType='MANAGED', isTemporary=False),
 Table(name='crypto_3', database='demo', description=None, tableType='MANAGED', isTemporary=False),
 Table(name='crypto_4', database='demo', description=None, tableType='EXTERNAL', isTemporary=False),
 Table(name='crypto_5', database='demo', description=None, tableType='EXTERNAL', isTemporary=False),
 Table(name='df', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

Now let’s run DROP TABLE to drop the tables we created.

%sql
-- Drop tables
DROP TABLE IF EXISTS crypto_1;
DROP TABLE IF EXISTS crypto_2;
DROP TABLE IF EXISTS crypto_3;
DROP TABLE IF EXISTS crypto_4;
DROP TABLE IF EXISTS crypto_5;

We can see that all the tables are deleted from the database.

# List table information
spark.catalog.listTables()
Out[15]: [Table(name='df', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

But the unmanaged tables saved on external locations such as the S3 bucket still exist.

To delete the external tables from S3, we can use dbutils.fs.rmrecurse=True deletes a folder and the files in the folder.

dbutils.fs.rm('/mnt/demo4tutorial/output/crypto_4/', recurse=True)
dbutils.fs.rm('/mnt/demo4tutorial/output/crypto_5/', recurse=True)

Summary

This tutorial demonstrates five different ways to create tables in databricks. It covered:

  • What’s the difference between managed and external tables?
  • How to mount S3 bucket to Databricks and read CSV to spark dataframe?
  • How to create a database in Databricks?
  • How to create a managed table from a spark datafram using pySpark?
  • How to create a managed table from an existing table using SQL?
  • How to create a managed table from a spark datafram using SQL?
  • How to create an unmanaged table from a spark datafram using pySpark?
  • How to create an unmanaged table from a spark datafram using SQL?
  • How to delete managed and external tables?

For more information about data science and machine learning, please check out my YouTube channel and Medium Page or follow me on LinkedIn.

Recommended For You

References

Leave a Comment

Your email address will not be published. Required fields are marked *