TurboInsights is a data engineering project that focuses on analyzing car listing data from CarGurus.com. This project leverages PySpark for data ingestion, transformation, and loading, MySQL for data warehousing and data mart creation, and Power BI for visualization and reporting. TurboInsights aims to provide insightful analytics on the car listing data, facilitating informed decision-making. Check out the final dashboard using this link: https://drive.google.com/drive/folders/1L5o1xxy59sc7M3tpxK4-GHPQzI0QDB7b?usp=drive_link
The project is structured into several key components:
- Data Ingestion: Extracting data from Kaggle and loading it into a staging area.
- Data Transformation: Cleaning, transforming, and enriching the data using PySpark.
- Data Storage: Storing the transformed data in a MySQL data warehouse.
- Data Mart Creation: Building a data mart from the data warehouse for efficient querying and analysis.
- Visualization: Creating interactive dashboards in Power BI to visualize and report the data insights.
- PySpark: For data ingestion, transformation, and enrichment.
- MySQL: For data warehousing and data mart creation.
- Power BI: For data visualization and reporting.
The data pipeline consists of the following steps:
- Data Extraction: The Data was extracted from CarGurus.com using a webscraping code and saved as a CSV file on Kaggle. Kaggle API was used to download the data and unzip it.
- Data Loading: The CSV file is loaded into a staging area using PySpark.
- Data Transformation:
- Dropping unrelevant columns from the dataset.
- Cleaning null values.
- Filtering wrong data.
- Changing columns data types.
- Parsing columns that contains numerical values.
- Transforming the data into fact & dimension tables.
- Joining & merging dataframes to fit into a snowflake schema.
- Data Storage: The transformed data is stored in a MySQL data warehouse.
- Data Mart Creation: A data mart focused on dealers perforamnce is created from the data warehouse for optimized analytical queries.
The data warehouse is designed using a snowflake schema to handle the large and complex car listings data efficiently. The snowflake schema ensures that data is normalized, reducing redundancy and improving query performance. This structure allows for detailed and accurate analysis of the car listings data.
The data mart is focused on dealer performance metrics, providing a streamlined subset of the data warehouse tailored for analytical queries related to dealers, making it easier to perform specific analytical queries related to dealer performance. This structure supports efficient reporting and visualization in Power BI.
The Power BI dashboard is built using the data from the data mart. It provides a comprehensive view of dealer performance, offering insights such as:
TurboInsights provides a robust framework for analyzing car listings data, from data ingestion and transformation to storage and visualization. By leveraging PySpark, MySQL, and Power BI, this project demonstrates how to build a scalable and efficient data engineering solution for real-world data analysis.
Feel free to reach out for any questions or further improvements!
Happy Analyzing!
Author: Zaid Hani Allwansah
Contact: [email protected]