Creating Azure Databricks Clusters

Azure Databricks Clusters are virtual machines that process the Spark jobs. The basic architecture of a cluster includes a Driver Node (labeled as Driver Type in the image below) and controls jobs sent to the Worker Nodes (Worker Types). The sizes of each node are based upon the sizes of Azure Virtual. The main deciding factor is how your workload is going to be performed. If there is a heavy amount of I/O then selecting worker nodes for Storage Optimized would be a good selection. If there is going to be more focus on processing, then select Compute Optimized. For most of the examples here they will be General Purpose.


Virtual Machines are spun up behind the scene as the Driver and Worker Nodes. In the example above the following three virtual machines are created in the subscription. One is for the Driver and the other two are Worker Nodes. The Max Workers is set to 8 and Databricks will automatically spin up new virtual machines to meet any demand. Please verify with your subscription quota for vCPUs can handle the maximum number of works.


Databricks Units (DBUs)

DBUs are units that are billed to your Azure subscription based on the virtual machines provisioned in the cluster. They are based on the processing capability per hour. There are two workloads that are built into the Databricks engine and it is important to understand the different. The Data Engineering workload is when a Job starts and stops on the cluster on which it runs. For example, a scheduled job will spin up a new Apache Spark cluster, run the job, and tear down the cluster one completed. This type of workload is cheaper than the other and used when interactivity is not required. The other workload type is called Data Analytics and is when commands are run on an Apache Spark cluster that is not an automated build from a scheduled job. This is more for running ad-hoc analysis, exploring data, training new models, or collaborating with colleagues.

Starting the Cluster

After clicking Create Cluster the virtual machines and additional resources will begin to be created. The status of the cluster will change to Pending and should complete within a few minutes. Once complete you can view the new resource and create notebooks that can be connected to the cluster. The status will then be set to Running and being incurring charges to your subscription.



Cluster Management


Now that we have a cluster started, we can manage it with the commands listed above. Edit will allow changing the cluster name, Runtime Version, Python Version, Driver and Work Types, and Auto Termination. Any edits will require restarting the cluster.

Clone will create a full copy of the cluster and assign the name with (clone) suffix. This will clone any libraries, notebooks, or permissions. This can be helpful when wanting to test out a feature on a different Runtime Version or increase the number of workers for performance testing.

Terminate will stop the cluster from running and any attached notebooks will become unattached. Permission settings, configurations, and libraries will remain associated. Charges to your subscription will be paused.

Delete will permanently remove the cluster from your subscription. This action is not able to be undone.

In the next post we will review the different Workloads available in Azure Databricks.




Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.