Azure Databricks is an Apache Spark-based big data analytics and machine learning framework optimized for the Microsoft Azure Cloud. Databricks is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.
This repository consists of seven labs that are designed to help you to understand how to use Azure Databricks for different use cases, including:
- Analysis of structured and unstructured data.
- Analysis of streaming data.
- Machine Learning (ML) and Machine Learning Operations (MLOPs)
The seven labs are implemented as seven independent Azure Databricks notebooks as described below.
This tutorial helps you understand how to use Azure Databricks Spark to prepare raw data for analytics.
This tutorial help you to understand the capabilities and features of Spark SQL and the various performance options provided by Azure Databricks.
This tutorial helps you understand the capabilities and features of Azure Spark MLlib for machine learning. It shows how to construct the end-to-end process for building and refining a machine learning model.
This tutorial helps you understand Azure Databricks Spark Structured Streaming. It shows the end-to-end process starting with data ingestion into a Azure Databricks cluster in near real-time, through analysis of the the streaming data and integration with machine learning.
This tutorial is helps you understand the features and capabilities of Azure Databricks Delta. Azure Databricks Delta is a next-generation unified analytics engine built on Apache Spark™. It provides ACID transactions, optimized layouts and indexes to enable big data use cases, from batch and streaming ingests, fast interactive queries to machine learning.
This tutorial helps you understand how Azure Data Factory (ADF) can be used with Azure Databricks, to create and automate piplines.
This tutorial helps you understand how to use the MLOps approach to automate machine learning with Azure Databricks. MLOps is a practice for collaboration and communication between data scientists and operations professionals to help manage production ML (or deep learning) lifecycle. MLOps in Azure provided through the integration of two Azure Services:
- Azure Machine Learning Service (Azure ML Service)
- Azure DevOps
Follow these instruction to download the Azure Databricks lab notebooks from this reposiory and import them into Azure Databricks.
1.In the Azure portal, select Create a resource > Data + Analytics > Azure Databricks.
2.Under Azure Databricks Service, provide the values to create a Azure Databricks workspace.
3.Provide the following values:
a. Workspace name : A name for your Azure Databricks workspace
b. Subscription : From the drop-down, select your Azure subscription.
c. Resource group : Specify whether you want to create a new resource group or use an existing one. A resource group is a container that holds related resources for an Azure solution. For more information, see Azure Resource Group overview.
d. Location : Specify the location of your Azure Databricks cluster.
e. Pricing Tier : Choose between Standard or Premium.
You can choose the Select Pin to Dashboard option to pin the resource to the Azure Dashboard
Then click Create
4.The workspace creation takes a few minutes. During workspace creation, the portal displays "Submitting deployment for Azure Databricks" tile on the right side. You may need to scroll right on your dashboard to see the tile. There is also a progress bar displayed near the top of the screen. You can watch either area for progress.
5.In the Azure portal, go to the Azure Databricks workspace that you created, and then click Launch Workspace.
6.You are now inside your Databricks workspace.
1.Go to Download tab and click on Download Repository.
2.The downloaded file is in the zip format.
3.If you extract the ZIP file, you will see the different lab notebooks. All the notebooks are of type .dbc, which stands for Databricks Archive.
1.Click the Workspace button or the Home button in the sidebar. Select Import as shown in the screenshot. Alternatively, next to any folder, click the right side of the text and select Import.
3.Browse to the location of your lab DBC file, and click import.
In Databricks you can create two different types of clusters:
1.Standard : Standard clusters are the default and can be used with Python, R, Scala, and SQL.
2.High Concurrency : High-concurrency clusters are tuned to provide the efficient resource utilization, isolation, security, and the best performance for sharing by multiple concurrently active users.
We'll be using a Standard cluster for our Labs.
1.Click the clusters icon in the sidebar.
2.Click the Create Cluster button at the top of the page.
3.Click the Create button. The cluster list page shows the status of the new cluster.
a. On the Create Cluster page, specify the cluster name QS.
b. select 5.2 (Scala 2.11, Spark 2.4.0) in the Databricks Runtime Version drop-down.
c.Click Create Cluster
- Go to the Databricks WorkSpace
- Go to your specific notebook in the workspace.