Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.7.0 Catalog fetch improvement #486

Merged
merged 6 commits into from
Oct 26, 2023
Merged

1.7.0 Catalog fetch improvement #486

merged 6 commits into from
Oct 26, 2023

Conversation

benc-db
Copy link
Collaborator

@benc-db benc-db commented Oct 24, 2023

Description

Catalog fetch improvement

Core idea is that we only get data for relations managed by dbt. For UC, this uses information schema. For HMS, this continues to use the existing approach, but filters to selected nodes. Will be updating this PR to refactor and add a couple of unit tests, but the key tests are basically the existing functional tests around documentation.

Checklist

  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt-databricks next" section.

@benc-db
Copy link
Collaborator Author

benc-db commented Oct 24, 2023

@mikealfare please take a look when you have time.

@rcypher-databricks rcypher-databricks added the blocked We cannot currently make progress on this issue, but it is on our radar. label Oct 24, 2023
@benc-db benc-db temporarily deployed to azure-prod October 24, 2023 17:23 — with GitHub Actions Inactive
@@ -13,23 +13,23 @@ def schema(self):
def models(self):
return "models"

def run_and_test(self, contains_catalog: bool):
def run_and_test(self):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests changed because I now explicitly set the catalog to hive_metastore if it is set to None (after checking the cluster setting). We were defaulting to spark_catalog, but I couldn't find where this was set, and it was causing me pain.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So question: is there any situation where changing this default behavior is breaking?

@benc-db benc-db temporarily deployed to azure-prod October 24, 2023 20:44 — with GitHub Actions Inactive
@benc-db benc-db temporarily deployed to azure-prod October 24, 2023 20:44 — with GitHub Actions Inactive
@benc-db benc-db temporarily deployed to azure-prod October 24, 2023 20:44 — with GitHub Actions Inactive
@benc-db benc-db temporarily deployed to azure-prod October 24, 2023 21:20 — with GitHub Actions Inactive
@benc-db benc-db temporarily deployed to azure-prod October 24, 2023 21:20 — with GitHub Actions Inactive
@benc-db benc-db temporarily deployed to azure-prod October 24, 2023 21:20 — with GitHub Actions Inactive
@benc-db
Copy link
Collaborator Author

benc-db commented Oct 24, 2023

@susodapop @rcypher-databricks ready for review...I had planned on doing more refactoring, in part to get functions that are only for hive into their own module, but I think I'll wait for the day when I can take advantage of what Jade is working on.

Copy link
Collaborator

@rcypher-databricks rcypher-databricks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

@benc-db benc-db removed the blocked We cannot currently make progress on this issue, but it is on our radar. label Oct 24, 2023
Copy link

@susodapop susodapop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -425,60 +431,54 @@ def parse_columns_from_information( # type: ignore[override]
def get_catalog(
self, manifest: Manifest, selected_nodes: Optional[Set] = None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting for my own sake that in the previous implementation of this method, selected_nodes was optional but was never enforced.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe selected_nodes was added recently for the get_catalog performance improvements. We had a miss where we called it with the additional argument before checking if we could and @benc-db had to update this signature as a result. So the kwarg did nothing besides allowing get_catalog to be called with an additional argument.

@benc-db benc-db merged commit 5d53855 into main Oct 26, 2023
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants