forked from open-metadata/OpenMetadata
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Docs: Auto Classification Doc Addition (open-metadata#19148)
Co-authored-by: Rounak Dhillon <[email protected]>
- Loading branch information
1 parent
68a0728
commit 5297b16
Showing
30 changed files
with
477 additions
and
28 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2 changes: 1 addition & 1 deletion
2
...bservability/profiler/auto-pii-tagging.md → ...n/Auto Classification/auto-pii-tagging.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
129 changes: 129 additions & 0 deletions
129
...-guides/data-governance/classification/Auto Classification/external-workflow.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,129 @@ | ||
--- | ||
title: External Auto Classification Workflow | ||
slug: /how-to-guides/data-governance/classification/auto/external-workflow | ||
--- | ||
|
||
# Auto Classification Workflow Configuration | ||
|
||
The Auto Classification Workflow enables automatic tagging of sensitive information within databases. Below are the configuration parameters available in the **Service Classification Pipeline JSON**. | ||
|
||
## Pipeline Configuration Parameters | ||
|
||
| **Parameter** | **Description** | **Type** | **Default Value** | | ||
|-------------------------------|---------------------------------------------------------------------------------|-----------|-------------------------| | ||
| `type` | Specifies the pipeline type. | String | `AutoClassification` | | ||
| `classificationFilterPattern`| Regex to compute metrics for tables matching specific tags, tiers, or glossary patterns. | Object | N/A | | ||
| `schemaFilterPattern` | Regex to fetch schemas matching the specified pattern. | Object | N/A | | ||
| `tableFilterPattern` | Regex to exclude tables matching the specified pattern. | Object | N/A | | ||
| `databaseFilterPattern` | Regex to fetch databases matching the specified pattern. | Object | N/A | | ||
| `includeViews` | Option to include or exclude views during metadata ingestion. | Boolean | `true` | | ||
| `useFqnForFiltering` | Determines whether filtering is applied to the Fully Qualified Name (FQN) instead of raw names. | Boolean | `false` | | ||
| `storeSampleData` | Option to enable or disable storing sample data for each table. | Boolean | `true` | | ||
| `enableAutoClassification` | Enables automatic tagging of columns that might contain sensitive information. | Boolean | `false` | | ||
| `confidence` | Confidence level for tagging columns as sensitive. Value ranges from 0 to 100. | Number | `80` | | ||
| `sampleDataCount` | Number of sample rows to ingest when Store Sample Data is enabled. | Integer | `50` | | ||
|
||
## Key Parameters Explained | ||
|
||
### `enableAutoClassification` | ||
- Set this to `true` to enable automatic detection of sensitive columns (e.g., PII). | ||
- Applies pattern recognition and tagging based on predefined criteria. | ||
|
||
### `confidence` | ||
- Confidence level for tagging sensitive columns: | ||
- A higher confidence value (e.g., `90`) reduces false positives but may miss some sensitive data. | ||
- A lower confidence value (e.g., `70`) identifies more sensitive columns but may result in false positives. | ||
|
||
### `storeSampleData` | ||
- Controls whether sample rows are stored during ingestion. | ||
- If enabled, the specified number of rows (`sampleDataCount`) will be fetched for each table. | ||
|
||
### `useFqnForFiltering` | ||
- When set to `true`, filtering patterns will be applied to the Fully Qualified Name of a table (e.g., `service_name.db_name.schema_name.table_name`). | ||
- When set to `false`, filtering applies only to raw table names. | ||
|
||
## Sample Auto Classification Workflow yaml | ||
|
||
```yaml | ||
source: | ||
type: bigquery | ||
serviceName: local_bigquery | ||
serviceConnection: | ||
config: | ||
type: BigQuery | ||
credentials: | ||
gcpConfig: | ||
type: service_account | ||
projectId: my-project-id-1234 | ||
privateKeyId: privateKeyID | ||
privateKey: "-----BEGIN PRIVATE KEY-----\nmySuperSecurePrivateKey==\n-----END PRIVATE KEY-----\n" | ||
clientEmail: [email protected] | ||
clientId: "1234567890" | ||
authUri: https://accounts.google.com/o/oauth2/auth | ||
tokenUri: https://oauth2.googleapis.com/token | ||
authProviderX509CertUrl: https://www.googleapis.com/oauth2/v1/certs | ||
clientX509CertUrl: https://www.googleapis.com/oauth2/v1/certs | ||
sourceConfig: | ||
config: | ||
type: AutoClassification | ||
storeSampleData: true | ||
enableAutoClassification: true | ||
databaseFilterPattern: | ||
includes: | ||
- hello-world-1234 | ||
schemaFilterPattern: | ||
includes: | ||
- super_schema | ||
tableFilterPattern: | ||
includes: | ||
- abc | ||
|
||
processor: | ||
type: "orm-profiler" | ||
config: | ||
tableConfig: | ||
- fullyQualifiedName: local_bigquery.hello-world-1234.super_schema.abc | ||
profileSample: 85 | ||
partitionConfig: | ||
partitionQueryDuration: 180 | ||
columnConfig: | ||
excludeColumns: | ||
- a | ||
- b | ||
|
||
sink: | ||
type: metadata-rest | ||
config: {} | ||
workflowConfig: | ||
# loggerLevel: INFO # DEBUG, INFO, WARN or ERROR | ||
openMetadataServerConfig: | ||
hostPort: http://localhost:8585/api | ||
authProvider: openmetadata | ||
securityConfig: | ||
jwtToken: "eyJraWQiOiJHYjM4OWEtOWY3Ni1nZGpzLWE5MmotMDI0MmJrOTQzNTYiLCJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJzdWIiOiJhZG1pbiIsImlzQm90IjpmYWxzZSwiaXNzIjoib3Blbi1tZXRhZGF0YS5vcmciLCJpYXQiOjE2NjM5Mzg0NjIsImVtYWlsIjoiYWRtaW5Ab3Blbm1ldGFkYXRhLm9yZyJ9.tS8um_5DKu7HgzGBzS1VTA5uUjKWOCU0B_j08WXBiEC0mr0zNREkqVfwFDD-d24HlNEbrqioLsBuFRiwIWKc1m_ZlVQbG7P36RUxhuv2vbSp80FKyNM-Tj93FDzq91jsyNmsQhyNv_fNr3TXfzzSPjHt8Go0FMMP66weoKMgW2PbXlhVKwEuXUHyakLLzewm9UMeQaEiRzhiTMU3UkLXcKbYEJJvfNFcLwSl9W8JCO_l0Yj3ud-qt_nQYEZwqW6u5nfdQllN133iikV4fM5QZsMCnm8Rq1mvLR0y9bmJiD7fwM1tmJ791TUWqmKaTnP49U493VanKpUAfzIiOiIbhg" | ||
``` | ||
## Workflow Execution | ||
### To Execute the Auto Classification Workflow: | ||
1. **Create a Pipeline** | ||
- Configure the Auto Classification JSON as demonstrated in the provided configuration example. | ||
2. **Run the Ingestion Pipeline** | ||
- Use OpenMetadata or an external scheduler like Argo to trigger the pipeline execution. | ||
3. **Validate Results** | ||
- Verify the metadata and tags applied to sensitive columns in the OpenMetadata UI. | ||
### Expected Outcomes | ||
- **Automatic Tagging:** | ||
Columns containing sensitive information (e.g., names, emails, SSNs) are automatically tagged based on predefined confidence levels. | ||
- **Enhanced Visibility:** | ||
Gain improved visibility and classification of sensitive data within your databases. | ||
- **Sample Data Integration:** | ||
Store sample data to provide better insights during profiling and testing workflows. | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
70 changes: 70 additions & 0 deletions
70
....x/how-to-guides/data-governance/classification/Auto Classification/workflow.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
--- | ||
title: Adding Auto Classification Workflow through UI | ||
slug: /how-to-guides/data-governance/classification/auto/workflow | ||
--- | ||
|
||
# Adding Auto Classification Ingestion through the UI | ||
|
||
Follow these steps to configure Auto Classification ingestion via the OpenMetadata UI: | ||
|
||
## 1. Navigate to the Database Service | ||
- Go to **Settings > Services > Databases** in the OpenMetadata UI. | ||
- Select the database for which you want to configure Auto Classification ingestion. | ||
|
||
{% image | ||
src="/images/v1.6/how-to-guides/governance/ac-1.png" | ||
alt="Settings" | ||
caption="Settings" | ||
/%} | ||
|
||
{% image | ||
src="/images/v1.6/how-to-guides/governance/ac-1.1.png" | ||
alt="Services" | ||
caption="Services" | ||
/%} | ||
|
||
{% image | ||
src="/images/v1.6/how-to-guides/governance/ac-2.png" | ||
alt="Databases" | ||
caption="Databases" | ||
/%} | ||
|
||
## 2. Access the Ingestion Tab | ||
- In the selected database, navigate to the **Ingestion** tab. | ||
- Click on the option to **Add Auto Classification Ingestion**, as shown in the example image. | ||
|
||
{% image | ||
src="/images/v1.6/how-to-guides/governance/ac-3.png" | ||
alt="Access the Ingestion Tab" | ||
caption="Access the Ingestion Tab" | ||
/%} | ||
|
||
## 3. Configure Auto Classification Details | ||
- Fill in the details for your Auto Classification ingestion workflow. | ||
- Each field's purpose is explained directly in the UI, allowing you to customize the configuration based on your requirements. | ||
|
||
{% image | ||
src="/images/v1.6/how-to-guides/governance/ac-4.png" | ||
alt="Configure Auto Classification Details" | ||
caption="Configure Auto Classification Details" | ||
/%} | ||
|
||
## 4. Set the Schedule | ||
- Specify the time interval at which the Auto Classification ingestion should run. | ||
|
||
{% image | ||
src="/images/v1.6/how-to-guides/governance/ac-5.png" | ||
alt="Set the Schedule" | ||
caption="Set the Schedule" | ||
/%} | ||
|
||
## 5. Add the Ingestion Workflow | ||
- Once all details are configured, click **Add Auto Classification Ingestion** to save and activate the workflow. | ||
|
||
{% image | ||
src="/images/v1.6/how-to-guides/governance/ac-6.png" | ||
alt="Add the Ingestion Workflow" | ||
caption="Add the Ingestion Workflow" | ||
/%} | ||
|
||
By following these steps, you can set up an Auto Classification ingestion workflow to automatically identify and tag sensitive data in your databases. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
47 changes: 47 additions & 0 deletions
47
...o-guides/data-governance/classification/Auto Classification/auto-pii-tagging.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
--- | ||
title: Auto PII Tagging | ||
slug: /how-to-guides/data-governance/classification/auto/auto-pii-tagging | ||
--- | ||
|
||
# Auto PII Tagging | ||
|
||
Auto PII tagging for Sensitive/NonSensitive at the column level is performed based on the two approaches described below. | ||
|
||
{% note %} | ||
PII Tagging is only available during `Profiler Ingestion`. | ||
{% /note %} | ||
|
||
|
||
## Tagging logic | ||
|
||
1. **Column Name Scanner**: We validate the column names of the table against a set of regex rules that help us identify | ||
common English patterns to identify email addresses, SSN, bank accounts, etc. | ||
2. **Entity Recognition**: If the sample data ingestion is enabled, we'll validate the sample rows against an Entity | ||
Recognition engine that will bring up any sensitive information from a list of [supported entities](https://microsoft.github.io/presidio/supported_entities/). | ||
In that case, the `confidence` parameter lets you tune the minimum score required to tag a column as `PII.Sensitive`. | ||
|
||
Note that if a column is already tagged as `PII`, we will ignore its execution. | ||
|
||
## Troubleshooting | ||
|
||
### SSL: CERTIFICATE_VERIFY_FAILED | ||
|
||
If you see an error similar to: | ||
|
||
``` | ||
Unexpected error while processing sample data for auto pii tagging - HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): | ||
Max retries exceeded with url: /explosion/spacy-models/master/compatibility.json | ||
(Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to | ||
get local issuer certificate (_ssl.c:1129)'))) | ||
``` | ||
|
||
This is a scenario that we identified on some corporate Windows laptops. The bottom-line here is that the profiler | ||
is trying to download the Entity Recognition model but having certificate issues when trying the request. | ||
|
||
A solution here is to manually download the model on the ingestion container / Airflow host by running: | ||
|
||
``` | ||
pip --trusted-host github.com --trusted-host objects.githubusercontent.com install https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.5.0/en_core_web_md-3.5.0.tar.gz | ||
``` | ||
|
||
If using Docker, you might want to customize the `openmetadata-ingestion` image to have this command run there by default. |
Oops, something went wrong.