Skip to content

This repository takes an example Airbnb dataset and shows how to conduct ML regression models in PySpark/Python in the Databricks cloud environment.

Notifications You must be signed in to change notification settings

ctrivino1/Databricks-and-Pyspark-ML-Regression-

Repository files navigation

Databricks-and-Pyspark-ML-Regression-

This repository is meant for users using the Databricks enviornment. One of the above files is a ".dbc", meaning that you need to use databricks in order to access the files that are contained within. Thankfully there is a free community version of Databricks that anyone can download and from which you can access the files found in the dbc file in this repo. The dbc file contains the ipynb files shown here.I have also included the "ipynb" files that are included in the ".dbc" file.

The data comes from New Brunswick, Canada's 2021 Airbnb data. This data set can be found on Airbnb's website http://insideairbnb.com/get-the-data.html . Here you can download a number of of datasets. The file I used was New Brunswick, Canada's .gz file. This file contains the raw data from which we clean up and perform machine learing on. The data will probably be more updated when you download the data set from the website. Most of the models are just baseline models because my Free Azure account with Databricks expired haha, and it started taking a while on the free community version.

Note: This is dbc file assumes that you understand Python, SQL, Pyspark, and machine learning. You will probably have to look up what some functions are doing. This should hopefully be a great teaching tool to show how to do ML(machine Learning) in Pyspark/Databricks.

(The code is primarily Pyspark, but there is some python/SQL involved) skills you'll learn with the dbc file:

  • Data cleaning
  • Creating Random Forest models
  • Creating Gradient Boosting models
  • Creating an Artifical Neural Newtork
  • Databricks ML flow trackiing system (You definilty will have to look up some of the code. But in short stores models, parameters, and metrics)
  • Databricks hyper opt library optimization
  • How to save a data table to delta Lake
  • One-Hot encoding (linear regression file)

In the future I would like to find a way to show how to use shapely charts with Pyspark models. Shapely charts are very powerful in explaining a ML model. I will work on updating my notebooks with some Shapely charts in the future.

About

This repository takes an example Airbnb dataset and shows how to conduct ML regression models in PySpark/Python in the Databricks cloud environment.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published