Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DAG to remove old entries from metadata database #245

Merged
merged 8 commits into from
Nov 8, 2023

Conversation

lucaszanotelli
Copy link
Contributor

@lucaszanotelli lucaszanotelli commented Nov 8, 2023

This PR adds a maintenance DAG to prune the content of Airflow’s metadata database.

This DAG removes old entries from job, dag_run, task_instance, log, xcom, sla_miss, dags, task_reschedule, task_fail and import_error tables by default. In the DAG, review the list of tables and decide whether old entries must be removed from them. In general, most space savings are provided by cleaning log, task_instance, dag_run and xcom tables. To exclude a table from cleanup, modify the DAG and comment corresponding items in the DATABASE_OBJECTS list.

The default retention period is 30 days. To modify that, create and set the max_db_entry_age_in_days variable with an integer value that corresponds to the amount of days to retain the logs.

The DAG has already ran successfully in the test-hubble environment: https://he02f27a661269b05p-tp.appspot.com/dags/cleanup_metadata_dag/grid

This workflow was created mainly following the guide: https://cloud.google.com/composer/docs/cleanup-airflow-database

@lucaszanotelli lucaszanotelli requested a review from a team as a code owner November 8, 2023 13:12
Copy link
Contributor

@sydneynotthecity sydneynotthecity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks for putting together to keep our Airflow environment healthy.

When I looked at the DAG definition in airflow, I noticed the cleanup_sessions task instance does not have an upstream dependency on print_configuration. I assume this is ok? The dependency tree just doesn't match the rest of the tasks.

Do you think it's also worthwhile to enable Sentry -> Slack alerting by adding the alert_after_max_retries parm to the PythonOperators?

Neither of these suggestions block the merge.

@lucaszanotelli
Copy link
Contributor Author

This looks great! Thanks for putting together to keep our Airflow environment healthy.

When I looked at the DAG definition in airflow, I noticed the cleanup_sessions task instance does not have an upstream dependency on print_configuration. I assume this is ok? The dependency tree just doesn't match the rest of the tasks.

Do you think it's also worthwhile to enable Sentry -> Slack alerting by adding the alert_after_max_retries parm to the PythonOperators?

Neither of these suggestions block the merge.

Regarding the cleanup_sessions dependency, it's not an issue since this task doesn't need any configuration loaded from the print_configuration task.

About the alert_after_max_retries, nice catch! I'd add it right way.

@lucaszanotelli lucaszanotelli merged commit 2bd16a6 into master Nov 8, 2023
4 checks passed
@lucaszanotelli lucaszanotelli deleted the cleanup-metadata-dag branch November 8, 2023 19:09
lucaszanotelli added a commit that referenced this pull request Nov 8, 2023
* dag to remove old entries from metadata database

* reformat dag file

* update `pre-commit` steps and configuration

* downgrade `pre-commit` `sqlfluff` step

* fix `cleanup_RenderedTaskInstanceFields` error

* add variable to change retention period

* toggle boolean to start deleting db entries

* add `alert_after_max_retries` callback function

Co-authored-by: lucas zanotelli <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants