To get started with Microsoft Azure Databricks, log into your Azure portal. If you do not have an Azure subscription, create a free account before you begin. To view previous posts please visit the following:
- What is Azure Databricks
- Getting started with Azure Databricks
- Creating Azure Databricks Clusters
- Azure Databricks Workloads
Azure Databricks Jobs
Jobs allow us to run a notebook or JAR file on a schedule. Like other task scheduling programs, Azure Databricks notebooks can be built to run in multiple different ways. This could be performing an ETL task of copying data into a blob storage container, capturing streaming data from a sensor, or retraining a predictive model after collecting new data. Jobs enable automating tasks and frees up resources to perform other tasks instead. Jobs can also be set up to send notifications when successfully or ran into issues. By default, all users can create and modify jobs, and enabling job access control will limit permissions.
Creating a new job
On the left-hand side of Azure Databricks, click the Jobs icon. Then on the Jobs page click on Create Job. In the following image you will be able to set the name (JOB4 in this example), set the task, set up a cluster, and schedule the timing.
Selecting Notebook in the task section will open a window to allow selecting a notebook in your workspace. Clicking on Set JAR will allow drag and drop of a JAR file and specifying the Main Class. The Configure spark-submit will allow setting parameters to pass into the JAR file or notebook in JSON format of an array of strings.
Configure the Cluster
Clicking on Edit in the Cluster section will bring up the Configure Cluster page. These are the same settings when creating a new cluster. One thing to notice is you are not able to name this cluster, because the cluster for a job is linked together and is spun up on each run of the job. This is also the Data Engineering workload pricing.
Set the Schedule
Clicking Edit in the Schedule section will show the following window. This allows setting the timing of the job to run. The lowest granularity is one minute and the highest being every 12 months. Job schedules can also be set based on Cron Syntax and more information can be found at this link.
Under Advanced shows Alerts, Concurrent Runs, Timeout, Retries, and Permissions. These should be straight forward, but we will cover them here. Alerts will send emails on Start, Success, or Failure. If you want to have the job run multiple times at once, the Maximum Concurrent Runs can be increased to 1000. Timeout is set in minutes and will terminate the job if there is no activity for the timeout period. Retries will restart the job if it fails and allows a wait time between 5 seconds to 3 hours. The permissions section is where controlling users and groups to view, own, or manage jobs.
Clicking on the Job name on the Job page will open the details. This will allow you to edit the Task, Parameters, Cluster, or Schedule. You can also view active jobs running or history of job runs. Below shows an example with no active running jobs but shows 3 runs that took place. These jobs were launched manually with one succeeding and two failures.
Clicking on the Run link will show the details of the failures, or results of any output messages sent by the job. It will also show the Task, parameters, cluster, duration, message, and status of the job.
The following shows the first run from the completed jobs. This was an intentional error message by running over the quota limit of cores in my Azure Subscription. It also shows the detailed messages provided by the Databricks job.
Azure Databricks Jobs are an easy way to control and manage data in your Azure environment. With Azure Databricks being able to run ETLs, ingest streaming data, query large data sets efficiently, or train predictive models on a schedule is certainly a welcomed addition.