Using Azure Databricks CLI

To get started with Microsoft Azure Databricks, log into your Azure portal. If you do not have an Azure subscription, create a free account before you begin.

Azure Databricks has two REST APIs for versions 2.0 and 1.2. These APIs allow general administration and management for different areas of your Databricks environment. In order to simplify ad-hoc tasks and manipulating content, Databricks has created a command line interface CLI. In this post we will install the CLI and perform common tasks.

Databricks command line interface allows for quick and easy interaction with the Databricks REST API. It is organized into the following sections: Workspace, Clusters, Groups, Jobs, Libraries, and Secrets. In this post we will review each command section and examples for each. To get started with Databricks CLI you will need to have Python installed on your machine and a Databricks cluster.

Install Databricks CLI

The requirements for Databricks CLI is Python 2.7.9 and above or Python 3.6 and above needs to be installed. Then use pip install databricks-cli to install the package and any dependencies.  Before working with Databricks CLI you will need to set up authentication. Please follow the instructions to set up a personal access token. Once you have a token run the command

databricks configure --token

You will be prompted for the Databricks Host and in my case it was the following: https://eastus2.azuredatabricks.net/. The next step is to paste in the personal access token. This will complete the installation process. In order to verify everything is set up properly we can test the next command.

Workspaces

A workspace is the root folder for the Databrick’s environment. Your organization can store all notebooks and assets. By default Workspaces are available to all users, but each user has a private home folder. With the databricks cli, you can manage assets within the Workspace.

To add an existing notebook to the Workspace, the following command will perform that action and return the result. After the import option the SOURCE_PATH is listed and then the TARGET_PATH. The -l option is to specify the language of the file.

databricks workspace import test.py /qa/test -l PYTHON

Now that we have imported a Python file, we can verify it exists by running the following command. This will show all files and folders in the qa folder.

databricks workspace list /qa

In addition, we can delete, export, and mkdirs using similar commands as the above import command.

Clusters

Cluster commands allow for management of Databricks clusters. When creating a cluster it is required to submit a json file or a json string. In the following example it shows what a command would look like to create a cluster. The cluster json file contents is listed afterwards.

databricks clusters create --json-file cluster.json
{
"cluster_name": "demo-cluster",
"spark_version": "5.1.x-scala2.11",
"node_type_id": "Standard_DS3_v2",
"autoscale" : {
"min_workers": 2,
"max_workers": 8
}
}

Groups

Groups API allows for management of grouping users. Members can be added and removed from a group. Groups can be created, deleted, and listed. Group permissions are only available in the Databricks Operational Security Package. Currently this an add-on package in AWS, but is built into the cost with Azure. The following command will list all members in the admins group.

databricks groups list-membmers --group-name admins

There are five permission levels for Databricks; No Permissions, Read, Run, Edit, and Manage. Read allows viewing cells and making comments on notebooks. Run adds attached and detaching notebooks from clusters, and running notebooks. Edit adds to Run by allowing editing of cells. Manage can perform all actions and change permissions of others. More details can be at Workspace Access Control.

Jobs

Jobs API allows for creating, editing, and running jobs. Jobs are more complicated than other APIs available for Databricks. Many of the commands require passing JSON files or strings for configurations. Below is an example getting information about a job in the cluster for job id 2 and the results.

databricks jobs get --job-id 2
{
"job_id": 2,
"settings": {
"name": "DEV-JOB",
"existing_cluster_id": "1120-025016-peep603",
"email_notifications": {},
"timeout_seconds": 0,
"notebook_task": {
"notebook_path": "/Users/sqlstack@outlook.com/Azure",
"revision_timestamp": 0
},
"max_concurrent_runs": 1
},
"created_time": 1542681533998,
"creator_user_name": "sqlstack@outlook.com"
}

Libraries

Libraries are used to extend functionality of Spark for a specific language. Libraries can be thought of as packages coming from a R or Python background. This example shows installing the Azure Event Hubs library on our demo-cluster and uses Maven coordinates.

databricks libraries install --cluster-id 1120-025016-peep603
--maven-coordinates com.microsoft.azure:azure-eventhubs:2.2.0

Secrets

Instead of directly entering credentials into a notebook, Databricks has secrets to store credentials and reference them in notebooks and jobs. This can be thought of as SQL Server credentials or a web.config app settings. Currently secrets are not supported in the Web Interface, but are available in the CLI and REST API.

The first step is to create a secret scope, then add secrets to the scope, and finally they can be referred to in a notebook. The following code will create a secret scope, then put a key into the Secret store.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.