Skip to content

Releases: datahub-project/datahub

DataHub v0.8.26

08 Feb 23:22
3668de8
Compare
Choose a tag to compare

This is a Bugfix release meant to address the issue with adding Glossary Terms to Dataset fields present in version 0.8.25.

Release Highlights

  • Fixing bug where Glossary Terms cannot be added to Dataset fields in previous release version.

DataHub v0.8.25

07 Feb 22:32
ec062b6
Compare
Choose a tag to compare

Known Issues

  • Adding Glossary Terms to schema fields does not work with this version due to a bug. Upgrade to v0.8.26 for the fix.

Release Highlights

Buckle up, folks! v0.8.25 brings some very exciting (and highly-requested!) updates.

Notable UI-Based Features

  • UI-based Ingestion - as demoed in December Town Hall, we now support creating, configuring, scheduling, & executing batch metadata ingestion using the DataHub user interface. This makes getting metadata into DataHub easier by minimizing the overhead required to operate custom integration pipelines.
  • Data Domains - DataHub now supports grouping data assets into logical collections called Domains. Domains are curated, top-level folders or categories where related assets can be explicitly grouped. Read the guide here!
  • Data Containers are now supported! This is the physical grouping of entities, ex. a Schema is a container of 1 or more Datasets; a Dashboard is a container of 1 or more Charts.

Notable Metadata Model & Ingestion-Based Features

  • Data Quality test results are now supported in the DataHub metadata model. This is the first milestone toward surfacing Dataset & Column-level Data Quality results in the UI (read full scope of work here). Future releases will include a Great Expectations integration & UI support - we’re on track to complete this in Q1 as planned.
  • Avro files are now supported in the Data Lake File ingestion source
  • Ingest metadata from multiple instances of the same platform type. This has been a very common use case within the Community - you can now differentiate multiple instances of the same platform type! If you already have pre-existing entries, use the datahub migrate command to migrate them over to platform instances.
  • Ignore users from Top Users calculation
    • feat(ingestion): Adding ability to ignore users from top users calculation by @treff7es in #3735
  • BigQuery - Data Profiling on only the latest partition/shard
    • feat(ingestion) bigquery: Profiling only the latest partition/shard on bigquery by @treff7es in #3930
  • (feat)(Business Glossary) add tabular schema and new UI for business glossary by @saxo-lalrishav in #3813

Notable Fixes

  • Fix to support View in Looker * feat(looker): Adding optional Looker external url base url config by @jjoyce0510 in #3985
  • fix(graphql): support group display name in ownership by @thomasplarsson in #3979
  • fix(profiling): Enabling profiling for low cardinality number columns by @treff7es in #3990
  • fix(ingestion): match default username for Azure OIDC and Azure ingestion source by @iasoon in #3926

DataHub Usage Guides

What's Changed

Read more

DataHub v0.8.24

24 Jan 21:42
f2e2a4d
Compare
Choose a tag to compare

Release Highlights

  • Adding support for nested Glue schemas
  • Adding Data Lake Files ingestion source to support data profiling for local files and files stored in AWS S3; supported file types are CSV, TSV, Parquet, and JSON
  • Improvements to readability in UI to format large numbers, including: adding thousands separators & rounding large numbers to millions with raw value available via tooltip
  • Miscellaneous bug fixes & improvements

What's Changed

New Contributors

Full Changelog: v0.8.23...v0.8.24

DataHub v0.8.23

14 Jan 23:06
a44b48a
Compare
Choose a tag to compare

Release Highlights

  • Fix critical Dashboard / Charts bug from 0.8.22, where Chart inputs were not being ingested successfully.
  • Adding currently deployed version to the UI (under top-right dropdown menu). Also available via the GMS /config endpoint.
  • Robustness improvements to DataHub Java Client Package
  • Introducing a new Elasticsearch ingestion connector!
  • Misc bug fixes & improvements.

What's Changed

Full Changelog: v0.8.22...v0.8.23

DataHub v0.8.22

09 Jan 00:59
bb0943f
Compare
Choose a tag to compare

Disclaimers!

  • Ingesting Chart Inputs was broken in a PR that got into this release. This will be fixed in v0.8.23. If you plan to ingest Charts / Dashboards, we recommend skipping this version and upgrading to v0.8.23 directly.

Release Highlights:

  • Support for mapping DBT meta properties of a dataset to metadata operations, such as add_owner, add_term, add_tag etc.
  • Java REST emitter library to programmatically generate metadata events from Java-based clients such as from Spark jobs.
  • Data freshness indication via Last Updated Timestamp.
  • Improvements to data profiling performance and lineage extraction

What's Changed

New Contributors

Full Changelog: v0.8.21...v0.8.22

v0.8.21

28 Dec 19:37
895af09
Compare
Choose a tag to compare

This release includes a fix for timeouts in reindexing of large indices that occurs when new fields are added to an index.

Release Highlights

  • Getting Started Modal + Empty State: Improve the experience of having no data ingested in DataHub by providing a "Getting Started" Guide when there is no data yet ingested.
  • Provide BigQuery credentials via recipe config: Previously BigQuery credentials were provided via environment variable. Going forward they can be provided directly inside the Recipe config.
  • Increase re-indexing 30s timeout: Previously elastic reindexing was maxed at a 30 second synchronous timeout. This was causing some upgrades of GMS to fail. This PR increases that timeout to one hour.

What's Changed

  • fix(lkml): bump lkml version up to 1.1.2 to support sql_preamble expression by @hyunminch in #3757
  • fix(react-ui): fix header min height by @gabe-lyons in #3784
  • docs(auth): add Microsoft Azure as an SSO provider (#3779) by @cccs-eric in #3780
  • Add azure OIDC doc to sidebar by @jjoyce0510 in #3785
  • feat(UI): Add "Getting Started" Modal on fresh deployment by @jjoyce0510 in #3773
  • feat(transform): adds simple add dataset properties transform by @sgomezvillamor in #3778
  • Update troubleshooting steps for local development with docker by @RyanHolstien in #3788
  • docs(redshift): Updating Redshift permission prerequisites in doc by @treff7es in #3777
  • fix(superset): fix Superset chart ingestion with an empty metric label by @cccs-eric in #3793
  • doc(transforms): adds doc for simple_add_dataset_properties transformer by @sgomezvillamor in #3790
  • feat(ingest): Add config option to set Bigquery credential in source config by @treff7es in #3786
  • fix(elastic): allow more time for re-indexing tasks by @gabe-lyons in #3794
  • docs(kafka): add example for ingestion from confluent cloud by @anshbansal in #3789

New Contributors

Full Changelog: v0.8.20...v0.8.21

v0.8.20

20 Dec 22:35
77e3641
Compare
Choose a tag to compare

This release includes the patch for CVE-2021-44228, pinning log4j to 0.2.17. Small bug fixes & improvements, otherwise.

Release Highlights

  • Configurable aspect retention in application.yml (disabled by default)
  • Metabase Ingestion Source connector
  • Constrain log4j to version 0.2.17
  • Upgrade logback to 1.2.9

What's Changed

  • feat(spark-lineage): add ability to push data lineage from spark to d… by @MugdhaHardikar-GSLab in #3664
  • feat(cli): allow to nuke without deleting data in quickstart by @anshbansal in #3655
  • feat(Dgraph): Make Dgraph a proper Neo4j alternative by @EnricoMi in #3578
  • feat(retention): Add retention to Local DB by @dexter-mh-lee in #3715
  • feat(ingest): cleanup deprecated datahub.integrations.airflow.* imports by @hsheth2 in #3732
  • feat(ingestion) : Add Metabase Source Connector by @jawadqu in #3602
  • fix(ingest): count profiled tables separately in report by @hsheth2 in #3731
  • feat(perf-test): changes for perf testing by @anshbansal in #3728
  • ci(cypress): adding the foundation for cypress integration tests & some starter coverage for login, search & updates by @gabe-lyons in #3672
  • (fix) Elastic search container log4j CVE-2021-44228 vulnerability by @nsbala-tw in #3733
  • Revert "feat(Dgraph): Make Dgraph a proper Neo4j alternative" by @gabe-lyons in #3740
  • fix(CI): Regenerate Docker Quickstart by @jjoyce0510 in #3741
  • fix(DataHubGraph): changing datahub-graph to use underlying session connection. by @varunbharill in #3743
  • fix(ingest): Remove unecessary isalpha check for data platforms + warnings by @jjoyce0510 in #3742
  • feat(snowflake-usage): add knob for direct objects accesssed vs base objects accessed by @gabe-lyons in #3744
  • fix(snowflake): support snowflake allow/deny pattern for lineage and usage by @varunbharill in #3748
  • refactor(gms auth): Remove base64 decoding of token service signing key by @jjoyce0510 in #3747
  • test(ingest): fix pytest warning for class starting with Test by @hsheth2 in #3745
  • feat: enables dbt metadata files to be loaded from URIs by @sgomezvillamor in #3739
  • fix(ingestion): Skipping duplicate tables from ingestion by @treff7es in #3753
  • feat(Stateful Ingestion): 1/3 Stateful ingestion server changes by @rslanka in #3749
  • Fix CVE-2021-44228 continued: log4j constraints to version 2.16.0 by @jjoyce0510 in #3755
  • build(ingest): restrict latest mypy version by @hsheth2 in #3756
  • doc: Add IOMED as a DataHub adopter by @merqurio in #3758
  • docs(spark-lineage): update artifact name and version by @MugdhaHardikar-GSLab in #3760
  • feat(profiler): add upper bound on combined query size by @hsheth2 in #3762
  • feat(ingestion): Mode retry wait logic to avoid hitting Mode API rate limit by @jawadqu in #3761
  • feat(Stateful Ingestion-2/3): Client side changes for checkpointing a source job state. by @rslanka in #3763
  • refactor(test): replace CliRunner with run_datahub_cmd method by @hsheth2 in #3746
  • feat(bigquery): add support for parsing exported bigquery audit logs by @hyunminch in #3680
  • feat(ingest): Adding support for Elasticsearch and Clickhouse by @sudotty in #3227
  • Upgrade to logback 1.2.9 to address CVE-2021-42550 by @jjoyce0510 in #3771
  • fix(profiling): Disabling expensive profilers by default by @treff7es in #3759
  • docs(ingestion): Add details of sensitive info handling by @anshbansal in #3767
  • docs(snowflake): Adding documentation about required Snowflake Privileges by @jjoyce0510 in #3770
  • Upgrade to 3rd Apache patch for log4j by @xiphl in #3772
  • fix(ingestion): Fix for same schema foreign key reference by @treff7es in #3769
  • fix(ingest): fix compatibility with google composer by @anshbansal in #3774

Known Issues

We've been made aware that in large deployments the re-indexing step required at boot-up time exceeds the 30 second timeout. We've since made changes to loosen this timeout limit, with these changes coming in 0.8.21.

New Contributors

Full Changelog: v0.8.19...v0.8.20

v0.8.19

13 Dec 19:13
83207b3
Compare
Choose a tag to compare

This release is a fast followup to the more substantial 0.8.18 release addressing bugs a few folks are facing in the Community.

Release Highlights

  • Fix base64 cli command issue where some systems do not have it.
  • Fix usage user extraction where email domain repeated twice.

What's Changed

  • fix(recommendations): don't show a 0 character when there are no suggestions by @gabe-lyons in #3720
  • fix(mode): support definitions in mode query by @gabe-lyons in #3721
  • fix(doc): fixing doc in datahub cli for corpuser urn. by @varunbharill in #3717
  • docs(redshift): Adding svv_table privilege requirement to redshift source doc by @treff7es in #3708
  • fix(profiler): Fixing division by zero in pct_unique calculation by @treff7es in #3727
  • fix(ingest): get mysql geotypes properly by @treff7es in #3726
  • fix(ingest): update trino source error handling in get_table_comment by @mayurinehate in #3712
  • feat(ingest) Trim long sql queries in usage by @treff7es in #3725
  • fix(ingestion): adds missing port to the connection bootstrap by @sgomezvillamor in #3706
  • fix(ingest): add source.config.connection.schema_registry_config to SchemaRegistryClient creation by @lvicentesanchez in #3702
  • fix(docker): Fix issues with base64 not working on some platforms by @dexter-mh-lee in #3723
  • feat(DataHubGraph): Adding utilities methods to DataHubGraph class. by @varunbharill in #3729
  • fix(superset): handle dashboards without charts (#3713) by @grumbler in #3714

New Contributors

Full Changelog: v0.8.18...v0.8.19

v0.8.18

10 Dec 19:45
d651040
Compare
Choose a tag to compare

DataHub Release 0.8.18 is here!

Release Highlights

  1. Metadata Service Authentication: Make authenticated requests to the Metadata Service APIs (GraphQL + Rest.li)

    1. Video Demo
    2. Technical Deep Dive
  2. Redshift Lineage: Out-of-the-box support for ingesting Dataset->Dataset lineage from Redshift system tables. Includes Tables, Views, and COPY from S3

    1. Video Demo
  3. Apache Nifi Connector (Beta) : Integration with Apache Nifi to extract DataJobs and DataFlows! Read the source docs here. This source is currently incubating in beta.

  4. Mode Connector (Beta): Integration with Mode Analytics to extract reports, charts, and more! Read the source docs here. This source is currently incubating in beta.

  5. Add Aspects without a fork: This is a major milestone towards No-Code UI

    1. Watch the No Code UI Sneak Peek
  6. Glossary Term Transformer: Allows users to add tags or glossary terms to entities based on a regex match filter (Shoutout to Community Member ecooklin!)

  7. Bug Fixes:

    1. [metadata service] Empty search query fails to resolve
    2. [metadata service] Log4j vulnerability addressed!! Highly recommend folks to upgrade to latest.
    3. [metadata ingestion] [bigquery] Fix handling of partitioned & snapshotted tables for lineage usage, and basic table indexing.
    4. [metadata-service] [recommendations] Fix issue where recently viewed and most popular recommendations were not showing up when user urn contains special chars.
    5. [metadata ingestion] Add config to specify ca certificate path for datahub-rest sink
    6. [metadata ingestion][snowflake] Handling for special characters in snowflake databases and schemas.
    7. [ui] Fix Groups page not showing asset ownership correctly
    8. [ui] Fix issue where markdown links were not clickable.
    9. [metadata service] Improve search & recommendations performance by ~50%, homepage load by ~50%.
    10. [cli] Fix deletes by search cannot accept auth token
    11. [metadata service][policies] Fix invalid Tag creation policy
    12. [metadata service][upgrade] Fix Spring injection of Entity Client inside datahub-upgrade

Backwards Incompatible Changes

  • The standalone Spring GraphQL Service has been removed. (Replaced in full by Metadata Service GraphQL API)

New Contributors

What's Changed

Read more

v0.8.17

19 Nov 07:58
f1045f8
Compare
Choose a tag to compare

Notable Changes

  • Added Recommendations and redesigned the home page!
    • Modular way to add recommendations throughout the application
    • Recommendation modules for top platforms, recently viewed, popular entities, top tags/terms were added to home page
    • Search page also has top tags/terms module on the bottom
  • Ingestion Sources
    • DBT enhancements
      • Creating dbt platform entities to capture dbt node types such as models, tests, source, seed, etc. linking dbt entities with other dbt or underlying platform entities.
    • OpenAPI specs
    • Kafka Connect (Regex based transformers, BigQuery sink)
    • Trino Usage (Starburst)
  • Improved lineage viz performance and lineage viz UX
    • Improved layout logic
    • Nodes can be dragged and dropped
  • Fixes for delete API not always deleting all of an entities data
  • Improved documentation for adding a custom Metadata Ingestion Source
    • Fixes description rendering for Charts, Dashboards, Flows, Jobs
  • Add YAML configuration file for Metadata Service
  • Filter search results by Sub-Type (Looker Explore, View, etc)
  • Support proxying DataHub Frontend requests to Metadata Service at /api/gms
  • Multi-platform (x86, arm64) support for Docker images (Apple M1 support)
  • Graph Service: DGraph support (phase 1)

What's Changed

Read more