To get started with Microsoft Azure Databricks, log into your Azure portal. If you do not have an Azure subscription, create a free account before you begin. To view previous posts please visit the following:
- What is Azure Databricks?
- Getting started with Azure Databricks
- Creating Azure Databricks Clusters
- Azure Databricks – Workloads
Libraries
Azure Databricks libraries allow integrating third-party or custom code run in your workloads. These libraries can be shared between all users, workspaces, and clusters. There are four types of libraries that can be added which we will cover in detail. Sometimes it may not be possible for specific versions or custom code to be loaded as a library. If that is the case, you can use include scripts during the cluster initialization to load packages.
Libraries are important to Azure Databricks because they provide additional functionality that can help solve problems. This is the same type of idea many other package management systems. However, this one allows multiple languages to be integrated into a Unified Analytics Platform.
The lifecycle of a library has the following four states: created, attached to a cluster, detached from a cluster, and deleted. When a library is created it can be automatically attached to all clusters or set to a single cluster. When a library is attached to a cluster it will show up with a status of Attached. The same is also true for detached libraries. If a library is no longer required, it can be moved to Trash. The library is not deleted until it is Emptying from the Trash.
Java / Scala Libraries
Java is an object-oriented language and designed to write once and run anywhere. Java has been large player in computer programming for over 20 years (1995). More information can be found at Java’s developer website. Java libraries are compiled and upload as JAR files. The Scala language is also object-oriented and designed to address limitations and expand upon Java. Scala has been around for 14 years (2004) and compiles to JAR files. Here are a few common libraries.
Python Libraries
Python language is object-oriented and is used for general purpose programming. There is a large community around Python and development of packages. Python has been available for 28 years (1990). These packages can be imported from Python Package Index (PyPI) repository. Additionally, you can upload your own Egg file which a package of Python code that does not require compilation. More information can be found here. Below are common libraries for Python development.
Maven Libraries
Apache Maven is a software project management tool used for many Java projects and other languages like Scala, Ruby, or C#. Maven can manage a project’s build, reporting and documentation for a central XML file called Project Object Model (POM). Maven has been part of the Apache Software Foundation for 14 years (2004). Maven libraries are installed from Maven Central or Spark Packages that are hosted on GitHub. Here are a few common Maven libraries on GitHub.
- https://github.com/databricks/spark-csv
- https://github.com/databricks/spark-sklearn
- https://github.com/nidi3/graphviz-java
R Libraries
The R language has been available for 25 years (1993) and very popular for statistical computing. Along with Python is a tool of choice for many data scientists. Libraries can be uploaded from custom built R script or downloaded from The Comprehensive R Archive Network (CRAN). See few common R libraries below.
- https://cran.r-project.org/web/packages/ggplot2/
- https://cran.r-project.org/web/packages/plyr/
- https://cran.r-project.org/web/packages/reshape2/
Conclusion
Libraries are required to do many advanced processing and versions should be carefully examined and tested before adding to a cluster. The libraries listed above are a small sample of the possibilities as there are thousands more. It is also important to remember before writing custom code to search and see if it has already been developed. Databricks allows many languages, your skill set will be able to get started quickly using code you already use and know. Learn more about Libraries at https://docs.azuredatabricks.net/user-guide/libraries.html