Databricks GitHub Repo Operations

Databricks GitHub Repo Operations

Databricks supports Git integration. In this tutorial, we will talk about how to do GitHub repo operation on the Databricks platform. You will learn:

  • How to enable files other than notebooks in the Databricks repo?
  • How to create a branch on Databricks?
  • How to create a notebook in Databricks repo?
  • How to push updates to a remote repository?
  • How to create and merge a pull request?
  • How to pull remote repository changes into Databricks?

Resources for this post:

If you prefer the video version of the tutorial, please check out the video on YouTube.

Databricks Github Repo Operations – GrabNGoInfo.com

Step 0: Enable Files In Repos

Databricks Repos only sync notebooks with a remote Git repository by default. To allow other file formats such as .py, .csv, and .md in the Repo, we need to change the Repos settings in the Admin Console.

Step 0.1: Open the Admin Console by clicking Settings, then Admin Console.

Step 0.2: Click Workspace Settings on the menu.

Step 0.3: Go to the Repos section and enable Files in Repos. After it is enabled, we can sync any file type with the remote Git repository. We can also view and edit text files in the UI.

Step 1: Create A Branch On Databricks

In the first step, we will create a branch on Databricks.

Step 1.1: Click Repos, then the user name, and main.

Step 1.2: Type a branch name and press the enter key to create the branch.

Step 1.3: To switch between branches, click the downward arrow and select the branch name. We selected the test_branch here. Click the Close button to close the window.

Step 2: Create A Notebook In Databricks Repo

In the second step, we will create a new notebook in the test branch.

There are three ways to create a new notebook in the repo, creating a notebook from scratch, importing an existing notebook, or cloning a notebook from Workspace.

Step 2.1: To create a notebook from scratch, go to Repos, user name, click the downward arrow near the branch name, select Create, then Notebook.

Step 2.2: To import an existing notebook, go to Repos, user name, click the downward arrow near the branch name, then select Import.

We can either import a file or import from a URL.

Step 2.3: To clone a notebook from the Workspace, go to Workspace, then user name. Click the downward arrow next to the notebook name and select Clone.

A new window will pop up. Give the cloned notebook a new name, then select Repos, user name, and the repo name. Click the blue Clone button to clone the notebook.

Step 2.4: To check if the notebook has been created, go to Repos, user name, then click the repo name. We can see that notebook has been successfully cloned to the branch.

Step 3: Push Updates To Remote Repository

In the third step, we will push updates to the remote repository.

Step 3.1: Click the branch name to open the repo window. We can see that the newly added example notebook is shown as the changed file. Provide a summary and click the blue Commit & Push button.

Step 4: Create And Merge Pull Request

Step 4.1: Go to the GitHub repo that is connected to Databricks, and you will see a message about the push. Click the green Compare & pull request button to open a pull request.

Step 4.2: On the Open a pull request page, click the green Create pull request button to create a pull request.

Step 4.3: After the pull request is approved, click the green Merge pull request button.

Then click the green Confirm merge button.

Step 4.4: After the merge is confirmed, we will see a purple Merged button showing that the pull request is successfully merged and closed. We can click the Delete branch button to delete the branch.

Step 4.5: Click the repository name, and we can see the notebook show up in the repository.

Step 4.6: Go to Databricks and switch to the main branch, we can see the newly added notebook now show up in the main branch.

Step 5: Pull Remote Repository Changes Into Databricks

In step 5, let’s make some changes to the README file on GitHub, and pull the changes into Databricks.

Step 5.1: Go to the GitHub repository for Databricks and click the pencil icon for the README file.

Step 5.2: I added the sentence “This is a new sentence added in GitHub.” and click the green Commit changes button.

Step 5.3: Go back to Databricks and click Repos, user name, then main. In the popup window, click Pull.

Step 5.4: A warning window pops up and ask if we want to “Preceded with pulling?”, click the blue Confirm button to start pulling the changes.

Step 5.5: After finishing pulling the changes, close the repo window then click the README file.

We can see that the new sentence we added on GitHub are pulled into Databricks.

Summary

In this tutorial, we talked about how to do GitHub repo operation on the Databricks platform. You learned:

  • How to enable files other than notebooks in the Databricks repo?
  • How to create a branch on Databricks?
  • How to create a notebook in Databricks repo?
  • How to push updates to a remote repository?
  • How to create and merge a pull request?
  • How to pull remote repository changes into Databricks?

For more information about data science and machine learning, please check out my YouTube channel and Medium Page or follow me on LinkedIn.

Recommended Tutorials

References

Leave a Comment

Your email address will not be published. Required fields are marked *