If you're more comfortable with Korean, see here.
What are the most common events on GitHub? Through this project, I've learned that the most common events on GitHub are "push" events, and the repository with the most events in March was Lombiq/Orchard!
Set up your environment to develop an Airflow DAG.
$ virtualenv venv --python=python3.10
$ source venv/bin/activate
(venv)$ pip install -r requirements.txt
(venv)$ pip install -r requirements-dev.txt
(venv)$ deactivate
Terraform provisions a total of five resources.
- a GCS bucket to store the extracted data
- a BigQuery dataset and table looking at the above GCS bucket as external storage
- a dataset and table to store the DBT output
$ cd terraform
$ terraform init
$ terraform plan -var="project=<your-gcp-project-id>"
$ terraform apply -var="project=<your-gcp-project-id>"
The VM running Airflow was created from the console, not provisioned with Terraform.
I referenced Max's article on how to install Airflow on a VM. Link: https://github.com/keeyong/data-engineering-batch12/blob/main/docs/Airflow%202%20Installation.md
Fetch data from the Github Archive every 30 minutes on the hour.
From the imported data, exclude the payload
field because I don't use it, convert the created_at
field as a string to a timestamp, and save it as a parquet.
I did this with DuckDB.
DuckDB is an open source that makes data processing easier for people familiar with SQL than the DataFrame API.
I've seen several blog posts that say it's more memory efficient than Pandas and less performant than Polars, but using Polars in Airflow throws an error similar to Exit With Return code 2
. As far as I could see, there was no workaround, so I switched from Polars to DuckDB and was very happy with the results :) (Actually, it's preferable to have the compute work outside of the Airflow instance, but I decided that since I only have one DAG, it would be overhead to work outside).
The parquet file is then loaded into GCS, which stores it as a Hive partitioning so that the BigQuery external table can be partitioned. Here I specified Hive partitions for year
, month
, and day
: (year={year}/month={month}/day={day}
)
This saves the cost of fetching data from the external table via the WHERE
clause.
The photo below shows the partitioning applied to the external table as normal. (Bytes 279MB -> 52MB, year
, month
, day
are partition prefixes, not columns in the existing data).
- NO
WHERE
statement
- With
Where
statement
I didn't build a separate CI/CD pipeline for the Airflow DAG file, I just updated the code by pushing the file and pulling it from within the VM.
$ sudo su airflow
$ cd GHArchive-Data-Pipeline-Project/ && \
git pull && \
cd .. && \
cp -r GHArchive-Data-Pipeline-Project/dags/* dags/
I initially tried Prefect, but the documentation was too unfriendly.
The documentation for uploading to a VM is not currently available, and the community was too small.
I realized that I might not be able to finish my malleable project, so I took the plunge and abandoned Prefect and moved to the more familiar Airflow.
In my opinion, Prefect will never be the new replacement for Airflow (honestly, it just sucks, there's nothing better).
See here for DBT-related documentation.
- Dataset Link: https://www.gharchive.org/