The repository contains the project information of the Data pipeline with Apache Airflow from Udacity Nanodegree
Data Engineer. Please refer to the
course website for more details.
Project scenario: A startup called Sparkify wants to automate and monitor their data warehousing ETL on AWS. They have decided to use
Apache Airflow to do the job. They have their original data resides in AWS S3 including user activity logs and song metadata in JSON format.
Database schema/tables and most of the ETL are already provided, the main task of this project is to use Apache Airflow to automate the ETL and monitor the pipeline using Airflow UI.
Data engineering team already have an ETL process and want to automate and monitor the ETL with Airflow, details about the ETL:
- Analytics team need to understand what songs their users are listening to by analyzing a set of dimensional/fact tables.
- Analytics team wants to build a Data Warehouse on AWS Redshift with tables designed to perform analytical queries and gain insights on song plays.
- Create and launch a AWS Redshift cluster
- Move existing ETL to AWS and automate/monitor the ETL with Airflow
- Python 3
- AWS
- Redshift SQL
- Apache Airflow
- LucidChart
Note that the actual data (in JSON) used in this project is a subset of original dataset preprocessed by the course. The provided data resides in AWS S3 (publically available).
- Song data from Million Song Dataset
- User activity data from Event Simulator based on Million Song Dataset
The schema is already provided for this project. A bit more explanation of the business process:
Since the core business process/metric is an user playing a song, the fact table should store the song play records with
user/song identifier together with related information about the how and where the song is played. Based on the data and tables
given in the project, the star schema looks like this (generated using LucidChart):
Copy the files in respective folders for your own Airflow installation and execute the DAG.
- The DAG will handle the creation of tables (only create tables if they don't exist)
- A sync-node to make sure tables are created before further operations in the DAG
- Need to set
catchup = True
and setschedule_interval = '@hourly'
since we only have data from 2018/11 - Staging tables will be truncated before loading by default, however the user can opt to append the table
- Fact tables will not be truncated before insert by default, however the user can opt to trancate the table first
- Dimension tables will be truncated before insert by default, however the user can opt to append the table
- Data quality operator will check for number of records and
NULL
values