-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementing metadata freshness checks #481
Conversation
BTW, there may be additional tests required as the guide calls out two more functional tests, but since those tests had a lot of snowflake specific comments in them (and so far are only in the snowflake adapter), I'm waiting until they're part of core before implementing here. I wanted to get this code up for people to look at while I'm away. |
dbt/adapters/databricks/relation.py
Outdated
@@ -115,3 +125,13 @@ def matches( | |||
@classproperty | |||
def get_relation_type(cls) -> Type[DatabricksRelationType]: | |||
return DatabricksRelationType | |||
|
|||
def information_schema(self, view_name=None) -> InformationSchema: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly copied from core, but I had to override so that it would use an information schema that used "`" for quoting, since InformationSchema inherits from BaseRelation.
project.adapter.drop_schema(relation) | ||
|
||
def test_get_relation_last_modified(self, project, set_env_vars, custom_schema): | ||
project.run_sql( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is currently hand copied from dbt-snowflake. Nothing I do in the test is specific to databricks, so hopefully it gets pulled into core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially the test just says, were we able to run the metadata commands and not warn/error. We definitely need another test that asserts that something useful happens. Will add after I'm back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the test to ensure the freshness tests passed, as opposed to not having a warning or error. @dataders might be worth considering this form for core; to me at least its more obvious what the test is doing (as opposed to the probe function). Also, this file did not match the format I found in the documentation for sources.json, btw.
@benc-db you da real MVP for posting this. I'll definitely flag this for the adapters eng team to look at once we're back from Coalesce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a few comments. Agree we need a test of the true output of the freshness check. But the smoke test is a good start. Nice work
_capabilities = CapabilityDict( | ||
{Capability.TableLastModifiedMetadata: CapabilitySupport(support=Support.Full)} | ||
) | ||
|
||
@available.parse(lambda *a, **k: 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the first time our dialect has needed to use _capabilities
? Or had we just inherited the _capabilities
from dbt-spark? Not a blocker for this PR but I wonder if more capabilities should be reflected explicitly here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is new in 1.7
quote_character: str = "`" | ||
|
||
def is_hive_metastore(self): | ||
return self.database is None or self.database == "hive_metastore" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just confirming: does dbt's database
== Databricks' schema
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
database == catalog.
{%- for relation in relations -%} | ||
select '{{ relation.schema }}' as schema, | ||
'{{ relation.identifier }}' as identifier, | ||
max(timestamp) as last_modified, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just confirming: the output of a Databricks SQL query that includes a max()
aggregation is never more than a single row, yes? That's why there is no GROUP BY
present here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the absence of group by, it will be a single row.
Co-authored-by: Jesse <[email protected]>
Description
Addresses the 'metadata freshness changes' part of dbt-labs/dbt-core#8307
Doing this in a way that can work for hive and unity was interesting. I'm a little concerned about the accuracy/perf around the hive implementation, but its the best I can do, and I don't have a great way to insert a check for hive metastore earlier in the call chain.
@dataders any concern that if someone has a view as a source (as is the case in my bigger transform jobs), then this mechanism doesn't work? I didn't see if there was any logic that only called down to this path if the relation is a table. Both using information schema in this way and describe history are table-only features. I mean we could look at information_schema.views, but last altered doesn't give the same sort of information for views as it does for tables.
Checklist
CHANGELOG.md
and added information about my change to the "dbt-databricks next" section.