Databricks Multi-Task Job Scheduling

Databricks Multi-Task Job Scheduling

Databricks job orchestration is a way to run a series of tasks automatically through a scheduling system. In this tutorial, you will learn:

  • How to create a Databricks job?
  • How to run Databricks jobs manually and using a scheduler?
  • How to set up Databricks job alerts?
  • How to manage Databricks job permissions?
  • How to check Databricks job run status?

Note that the job scheduler is only available in the paid version of Databricks. To learn how to upgrade from the Databricks free version, check out my tutorial on Databricks Community Edition Upgrade To Paid Plan AWS Setup. If you decide to skip the Databricks free community edition and start the Premium plan directly, check out my tutorial on Databricks AWS Account Setup.

Resources for this post:

Databricks Multi-Task Job Scheduler – GrabNGoInfo.com

Let’s get started!

Step 1: Create Tasks for the Job

In the first step, we will create a few notebooks containing the tasks for the job.

Let’s create an example job with four tasks implemented in four notebooks.

  • The first notebook contains the code for pulling data from one data source.
  • The second notebook contains the code for pulling data from another data source
  • The third notebook merges the data from the two data sources
  • The fourth notebook does the data transformation and save the transformed data.

To learn how to read data and save tables in Databricks, please refer to my tutorial on Five Ways To Create Tables In Databricks.

Step 2: Create a Job

Step 2.1: Click the Jobs icon on the left pane.

Step 2.2: On the Jobs page, click the blue Create job button.

Step 2.3: Give your job a name. I gave it the name of “Job Example”.

Step 2.4: Create the first task.

Let’s give the first task a name.

Under Type, choose Notebook and select the notebook for the first task.

Under Cluster, leave it as default New Job Cluster. A job cluster is different from an all-purpose cluster.

  • A job cluster automatically starts for a scheduled job and automatically terminates when the job is completed. We cannot restart a job cluster.
  • An all-purpose cluster can be manually termated and restarted. Multiple users can share the same cluster and run interactive notebooks.

We can add parameters for the task, add dependent libraries, configure the retry policy and timeout policy if needed.

Click the blue Create button to create the first task.

Step 2.5: After the first task is created, click the blue plus (+) button to add the next task.

Step 2.6: Fill in the name and notebook path for the second task. Under Depends on, click the cross to remove the dependency to the previous step. This is because the first two tasks pull data from different sources, and they can run in parallel.

After clicking the blue Create task button, we can see both task 1 and task 2, and there are no dependencies between the two tasks.

Step 2.7: Click the blue plus button and add the third task. Under Depends on, select both task 1 and task 2. This is because the third step is to merge the data, which can only execute after the data pull is completed for both task 1 and task 2.

Step 2.8: Click the blue Create task button and task 3 is added to the job with dependency on task 1 and task 2.

Step 2.9: Click the blue plus button to add the fourth task. The fourth task does the data processing and it depends on the output from task 3, so under Depends on, task 3 is selected.

Click the blue Create task button, and we finished creating the job!

Step 3: Run Databricks Jobs

We can trigger a databricks job run manually or use a job scheduler to automatically run a job on a fixed schedule.

Step 3.1: To create a job schedule, click the Edit schedule button under the Schedule section. For Schedule Type, select Scheduled. Then select run frequency and time. I set up the job to run every day at 9 Eastern time in the morning.

Click the blue Save button to save the schedule.

Step 3.2: To start a job run manually, click the blue Run now button.

Step 4: Set up Databricks Job Alerts

Step 4.1: To set up a Databricks job alert, click the Edit alerts button under the Alerts section.

Step 4.2: We can add a list of emails to receive alerts about job start, complete, or job failures by clicking the blue Add.

Step 4.3: I entered my email address and choose to receive only failures alert.

Step 4.4: Click the blue Confirm button, and we will see the alerts settings under the Alerts section.

Step 5: Manage Databricks Job Permissions

To manage Databricks job permissions, click the Edit permissions button under the Permissions section.

In the popup Permissios window, we can add users, groups or service principals and give them View, Manage Run, Manage or Owner permissions.

  • Can View means that the user can view the job run results.
  • Can Manage Run means that the user can trigger and cancel a job run.
  • Can Manage means that the user can view, manage, and edit a job run.
  • Is Owner means that the user owns the job. There is only one owner for each job.

Step 6: Check Databricks Job Run Status

Step 6.1: To check the Databricks job run status, click the Jobs icon on the left pane.

Step 6.2: On the Jobs page, we can see a list of jobs under the Jobs menu. For the example job we just created, it shows that the last run succeeded. We can trigger a new run by clicking the triangle or delete the job by clicking the trash can icon under Actions.

Step 6.3: Clicking the name of job brings us to the page for this job. It shows the active runs if there are any, and the completed runs in the past 60 days.

Step 6.4: Go back to the Jobs page by clicking the Blue Jobs on the top left corner. Click the Blue Succeeded to check the job run details.

We can see that every task is green for this job. For a failed job run, the failed tasks are red with the information for the failure reasons.

Summary

In this tutorial, you learned:

  • How to create a Databricks job?
  • How to run Databricks jobs manually and using a scheduler?
  • How to set up Databricks job alerts?
  • How to manage Databricks job permissions?
  • How to check Databricks job run status?

To learn more about Databricks, please check out my YouTube playlist for Databricks.

For more information about data science and machine learning, please check out my YouTube channel and Medium Page or follow me on LinkedIn.

Recommended Tutorials

References

Leave a Comment

Your email address will not be published. Required fields are marked *