To get started with Microsoft Azure Databricks, log into your Azure portal. If you do not have an Azure subscription, create a free account before you begin. To view previous posts please visit the following:
Workloads
There are two types of workloads available in Azure Databricks. One is called Data Engineering and the other is Data Analytics. They are similar in nature and perform the same types of operations, but one is for scheduled operations (Engineering) and the other ad-hoc operations (Analytics). In addition to the two types of workloads there are two types of tier features. In this section we will review the features available, when to use either the Standard or Premium features, and how workloads are priced.
Standard Tier Features
Standard Data Engineering includes Apache Spark Clusters, a scheduler for running libraries and notebooks, alerting and monitoring, notebook workflows, and production streaming with monitoring. The Data Engineering workload is used for running scheduled jobs and will spin up and tear down a cluster for the duration of the job. This is great for streaming data, ETL, and repeatable actions.
Standard Data Analytics includes the same as the above and a few other features. The Apache Spark Clusters can be persistent for analytics, auto-scaling, multi-user sharing and collaboration. It also includes SQL, Python, R, and Scala notebooks, One-click visualizations, an interactive dashboard, revision history, and GitHub integration. The Data Analytics workload is used for investigation and collaboration of data and is great for presenting information. This workload is also designed for the data science group to analyze data and develop machine learning models. The cluster is managed by the administrator in the workload.
Premium Tier Features
Premium Data Engineering and Data Analytics both include the standard features and Role-based access, JDBC/ODBC endpoint authentication. Security and authentication are the deciding factor between standard and premium. If you will have multiple users in the environment it will make sense that some will only need access to perform specific operations. If this is the case please use premium, otherwise standard should be enough to get started.
Workload and Tier Pricing
Pricing for each workload is based on the number of Databricks Units (DBUs) assigned to the selected VM instance. Each workload and tier as their own price and the current pricing can be located here. As of November 2018, the following is pricing per DBU. Keep in mind these base prices are then multiplied based on the virtual machine instance selected.
Workload | Standard | Premium |
Data Engineering | $0.20 / DBU – hour | $0.35 / DBU – hour |
Data Analytics | $0.40 / DBU – hour | $0.55 / DBU – hour |
Data Engineering Workloads will show up under Job Clusters and are built during the job running. Each time a job runs a new cluster will be created. Once the job is complete, they will show up as being terminated. In the image below the Error message is caused by my subscription being over the quota. This is an example of why knowing your subscription quota is important.
Data Analytics Workloads show up under the Interactive Clusters section. These are managed manually by the Azure Databricks administrator role.