Docs: Auto Classification Doc Addition (open-metadata#19148)

Co-authored-by: Rounak Dhillon <[email protected]>
akashverma0786 · Jan 2, 2025 · 5297b16 · 5297b16
1 parent 68a0728
commit 5297b16
Show file tree

Hide file tree

Showing 30 changed files with 477 additions and 28 deletions.
diff --git a/openmetadata-docs/content/v1.6.x/collate-menu.md b/openmetadata-docs/content/v1.6.x/collate-menu.md
@@ -686,8 +686,6 @@ site_menu:
     url: /how-to-guides/data-quality-observability/profiler/external-sample-data
   - category: How-to Guides / Data Quality and Observability / Data Profiler / External Workflow
     url: /how-to-guides/data-quality-observability/profiler/external-workflow
-  - category: How-to Guides / Data Quality and Observability / Data Profiler / Auto PII Tagging
-    url: /how-to-guides/data-quality-observability/profiler/auto-pii-tagging
   - category: How-to Guides / Data Quality and Observability / Data Observability
     url: /how-to-guides/data-quality-observability/observability
   - category: How-to Guides / Data Quality and Observability / Data Observability / Observability Alerts
@@ -773,6 +771,12 @@ site_menu:
     url: /how-to-guides/data-governance/classification/request-tags
   - category: How-to Guides / Data Governance / Classification / Auto-Classification in OpenMetadata
     url: /how-to-guides/data-governance/classification/auto
+  - category: How-to Guides / Data Governance / Classification / Auto-Classification in OpenMetadata / Workflow
+    url: /how-to-guides/data-governance/classification/auto/workflow
+  - category: How-to Guides / Data Governance / Classification / Auto-Classification in OpenMetadata / External Workflow
+    url: /how-to-guides/data-governance/classification/auto/external-workflow
+  - category: How-to Guides / Data Governance / Classification / Auto-Classification in OpenMetadata / Auto PII Tagging
+    url: /how-to-guides/data-governance/classification/auto/auto-pii-tagging
   - category: How-to Guides / Data Governance / Classification / What are Tiers
     url: /how-to-guides/data-governance/classification/tiers
   - category: How-to Guides / Data Governance / Classification / Best Practices for Classification

diff --git a/...bservability/profiler/auto-pii-tagging.md → ...n/Auto Classification/auto-pii-tagging.md b/...bservability/profiler/auto-pii-tagging.md → ...n/Auto Classification/auto-pii-tagging.md
@@ -1,6 +1,6 @@
 ---
 title: Auto PII Tagging
-slug: /how-to-guides/data-quality-observability/profiler/auto-pii-tagging
+slug: /how-to-guides/data-governance/classification/auto/auto-pii-tagging
 ---
 
 # Auto PII Tagging

diff --git a/...-guides/data-governance/classification/Auto Classification/external-workflow.md b/...-guides/data-governance/classification/Auto Classification/external-workflow.md
@@ -0,0 +1,129 @@
+---
+title: External Auto Classification Workflow
+slug: /how-to-guides/data-governance/classification/auto/external-workflow
+---
+
+# Auto Classification Workflow Configuration
+
+The Auto Classification Workflow enables automatic tagging of sensitive information within databases. Below are the configuration parameters available in the **Service Classification Pipeline JSON**.
+
+## Pipeline Configuration Parameters
+
+| **Parameter**                | **Description**                                                                 | **Type**  | **Default Value**      |
+|-------------------------------|---------------------------------------------------------------------------------|-----------|-------------------------|
+| `type`                       | Specifies the pipeline type.                                                    | String    | `AutoClassification`    |
+| `classificationFilterPattern`| Regex to compute metrics for tables matching specific tags, tiers, or glossary patterns. | Object    | N/A                     |
+| `schemaFilterPattern`         | Regex to fetch schemas matching the specified pattern.                         | Object    | N/A                     |
+| `tableFilterPattern`          | Regex to exclude tables matching the specified pattern.                        | Object    | N/A                     |
+| `databaseFilterPattern`       | Regex to fetch databases matching the specified pattern.                       | Object    | N/A                     |
+| `includeViews`                | Option to include or exclude views during metadata ingestion.                  | Boolean   | `true`                  |
+| `useFqnForFiltering`          | Determines whether filtering is applied to the Fully Qualified Name (FQN) instead of raw names. | Boolean   | `false`                 |
+| `storeSampleData`             | Option to enable or disable storing sample data for each table.                | Boolean   | `true`                  |
+| `enableAutoClassification`    | Enables automatic tagging of columns that might contain sensitive information. | Boolean   | `false`                 |
+| `confidence`                  | Confidence level for tagging columns as sensitive. Value ranges from 0 to 100. | Number    | `80`                    |
+| `sampleDataCount`             | Number of sample rows to ingest when Store Sample Data is enabled.             | Integer   | `50`                    |
+
+## Key Parameters Explained
+
+### `enableAutoClassification`
+- Set this to `true` to enable automatic detection of sensitive columns (e.g., PII).
+- Applies pattern recognition and tagging based on predefined criteria.
+
+### `confidence`
+- Confidence level for tagging sensitive columns:
+  - A higher confidence value (e.g., `90`) reduces false positives but may miss some sensitive data.
+  - A lower confidence value (e.g., `70`) identifies more sensitive columns but may result in false positives.
+
+### `storeSampleData`
+- Controls whether sample rows are stored during ingestion.
+- If enabled, the specified number of rows (`sampleDataCount`) will be fetched for each table.
+
+### `useFqnForFiltering`
+- When set to `true`, filtering patterns will be applied to the Fully Qualified Name of a table (e.g., `service_name.db_name.schema_name.table_name`).
+- When set to `false`, filtering applies only to raw table names.
+
+## Sample Auto Classification Workflow yaml
+
+```yaml
+source:
+  type: bigquery
+  serviceName: local_bigquery
+  serviceConnection:
+    config:
+      type: BigQuery
+      credentials:
+        gcpConfig:
+          type: service_account
+          projectId: my-project-id-1234
+          privateKeyId: privateKeyID
+          privateKey: "-----BEGIN PRIVATE KEY-----\nmySuperSecurePrivateKey==\n-----END PRIVATE KEY-----\n"
+          clientEmail: [email protected]
+          clientId: "1234567890"
+          authUri: https://accounts.google.com/o/oauth2/auth
+          tokenUri: https://oauth2.googleapis.com/token
+          authProviderX509CertUrl: https://www.googleapis.com/oauth2/v1/certs
+          clientX509CertUrl: https://www.googleapis.com/oauth2/v1/certs
+  sourceConfig:
+    config:
+      type: AutoClassification
+      storeSampleData: true
+      enableAutoClassification: true
+      databaseFilterPattern:
+        includes: 
+          - hello-world-1234
+      schemaFilterPattern:
+        includes: 
+          - super_schema
+      tableFilterPattern:
+        includes: 
+          - abc
+
+processor:
+   type: "orm-profiler"
+   config:
+    tableConfig:
+      - fullyQualifiedName: local_bigquery.hello-world-1234.super_schema.abc
+        profileSample: 85
+        partitionConfig:
+          partitionQueryDuration: 180
+        columnConfig:
+          excludeColumns:
+            - a
+            - b
+
+sink:
+  type: metadata-rest
+  config: {}
+workflowConfig:
+#  loggerLevel: INFO # DEBUG, INFO, WARN or ERROR
+  openMetadataServerConfig:
+    hostPort: http://localhost:8585/api
+    authProvider: openmetadata
+    securityConfig:
+      jwtToken: "eyJraWQiOiJHYjM4OWEtOWY3Ni1nZGpzLWE5MmotMDI0MmJrOTQzNTYiLCJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJzdWIiOiJhZG1pbiIsImlzQm90IjpmYWxzZSwiaXNzIjoib3Blbi1tZXRhZGF0YS5vcmciLCJpYXQiOjE2NjM5Mzg0NjIsImVtYWlsIjoiYWRtaW5Ab3Blbm1ldGFkYXRhLm9yZyJ9.tS8um_5DKu7HgzGBzS1VTA5uUjKWOCU0B_j08WXBiEC0mr0zNREkqVfwFDD-d24HlNEbrqioLsBuFRiwIWKc1m_ZlVQbG7P36RUxhuv2vbSp80FKyNM-Tj93FDzq91jsyNmsQhyNv_fNr3TXfzzSPjHt8Go0FMMP66weoKMgW2PbXlhVKwEuXUHyakLLzewm9UMeQaEiRzhiTMU3UkLXcKbYEJJvfNFcLwSl9W8JCO_l0Yj3ud-qt_nQYEZwqW6u5nfdQllN133iikV4fM5QZsMCnm8Rq1mvLR0y9bmJiD7fwM1tmJ791TUWqmKaTnP49U493VanKpUAfzIiOiIbhg"
+```
+
+## Workflow Execution
+
+### To Execute the Auto Classification Workflow:
+
+1. **Create a Pipeline**  
+   - Configure the Auto Classification JSON as demonstrated in the provided configuration example.
+
+2. **Run the Ingestion Pipeline**  
+   - Use OpenMetadata or an external scheduler like Argo to trigger the pipeline execution.
+
+3. **Validate Results**  
+   - Verify the metadata and tags applied to sensitive columns in the OpenMetadata UI.
+
+### Expected Outcomes
+
+- **Automatic Tagging:**  
+  Columns containing sensitive information (e.g., names, emails, SSNs) are automatically tagged based on predefined confidence levels.
+
+- **Enhanced Visibility:** 
+  Gain improved visibility and classification of sensitive data within your databases.
+
+- **Sample Data Integration:**  
+  Store sample data to provide better insights during profiling and testing workflows.
+
diff --git a/...es/data-governance/classification/auto.md → ...assification/Auto Classification/index.md b/...es/data-governance/classification/auto.md → ...assification/Auto Classification/index.md
@@ -31,7 +31,7 @@ alt="Column Data provides information"
 caption="Column Data provides information"
 /%}
 
-You can read more about [Auto PII Tagging](/how-to-guides/data-quality-observability/profiler/auto-pii-tagging) here.
+You can read more about [Auto PII Tagging](/how-to-guides/data-governance/classification/auto/auto-pii-tagging) here.
 
 ## Tag Mapping
 

diff --git a/....x/how-to-guides/data-governance/classification/Auto Classification/workflow.md b/....x/how-to-guides/data-governance/classification/Auto Classification/workflow.md
@@ -0,0 +1,70 @@
+---
+title: Adding Auto Classification Workflow through UI
+slug: /how-to-guides/data-governance/classification/auto/workflow
+---
+
+# Adding Auto Classification Ingestion through the UI
+
+Follow these steps to configure Auto Classification ingestion via the OpenMetadata UI:
+
+## 1. Navigate to the Database Service
+- Go to **Settings > Services > Databases** in the OpenMetadata UI.
+- Select the database for which you want to configure Auto Classification ingestion.
+
+{% image
+src="/images/v1.6/how-to-guides/governance/ac-1.png"
+alt="Settings"
+caption="Settings"
+/%}
+
+{% image
+src="/images/v1.6/how-to-guides/governance/ac-1.1.png"
+alt="Services"
+caption="Services"
+/%}
+
+{% image
+src="/images/v1.6/how-to-guides/governance/ac-2.png"
+alt="Databases"
+caption="Databases"
+/%}
+
+## 2. Access the Ingestion Tab
+- In the selected database, navigate to the **Ingestion** tab.
+- Click on the option to **Add Auto Classification Ingestion**, as shown in the example image.
+
+{% image
+src="/images/v1.6/how-to-guides/governance/ac-3.png"
+alt="Access the Ingestion Tab"
+caption="Access the Ingestion Tab"
+/%}
+
+## 3. Configure Auto Classification Details
+- Fill in the details for your Auto Classification ingestion workflow.
+- Each field's purpose is explained directly in the UI, allowing you to customize the configuration based on your requirements.
+
+{% image
+src="/images/v1.6/how-to-guides/governance/ac-4.png"
+alt="Configure Auto Classification Details"
+caption="Configure Auto Classification Details"
+/%}
+
+## 4. Set the Schedule
+- Specify the time interval at which the Auto Classification ingestion should run.
+
+{% image
+src="/images/v1.6/how-to-guides/governance/ac-5.png"
+alt="Set the Schedule"
+caption="Set the Schedule"
+/%}
+
+## 5. Add the Ingestion Workflow
+- Once all details are configured, click **Add Auto Classification Ingestion** to save and activate the workflow.
+
+{% image
+src="/images/v1.6/how-to-guides/governance/ac-6.png"
+alt="Add the Ingestion Workflow"
+caption="Add the Ingestion Workflow"
+/%}
+
+By following these steps, you can set up an Auto Classification ingestion workflow to automatically identify and tag sensitive data in your databases.
diff --git a/...-docs/content/v1.6.x/how-to-guides/data-quality-observability/profiler/index.md b/...-docs/content/v1.6.x/how-to-guides/data-quality-observability/profiler/index.md
@@ -43,10 +43,4 @@ Watch the video to understand OpenMetadata’s native Data Profiler and Data Qua
     href="/how-to-guides/data-quality-observability/profiler/external-workflow"%}
     Run a single workflow profiler for the entire source externally.
  {%/inlineCallout%}
- {%inlineCallout
-    icon="MdOutlinePersonPin"
-    bold="Auto PII Tagging"
-    href="/how-to-guides/data-quality-observability/profiler/auto-pii-tagging"%}
-    Auto tag data as PII Sensitive/NonSensitive at the column level.
- {%/inlineCallout%}
 {%/inlineCalloutContainer%}
diff --git a/openmetadata-docs/content/v1.6.x/how-to-guides/user-guide-data-users/tags.md b/openmetadata-docs/content/v1.6.x/how-to-guides/user-guide-data-users/tags.md
@@ -66,7 +66,7 @@ alt="Column Data provides information"
 caption="Column Data provides information"
 /%}
 
-You can read more about [Auto PII Tagging](/how-to-guides/data-quality-observability/profiler/auto-pii-tagging) here.
+You can read more about [Auto PII Tagging](/how-to-guides/data-governance/classification/auto/auto-pii-tagging) here.
 
 {%inlineCallout
   color="violet-70"

diff --git a/openmetadata-docs/content/v1.6.x/menu.md b/openmetadata-docs/content/v1.6.x/menu.md
@@ -849,8 +849,6 @@ site_menu:
     url: /how-to-guides/data-quality-observability/profiler/external-sample-data
   - category: How-to Guides / Data Quality and Observability / Data Profiler / External Workflow
     url: /how-to-guides/data-quality-observability/profiler/external-workflow
-  - category: How-to Guides / Data Quality and Observability / Data Profiler / Auto PII Tagging
-    url: /how-to-guides/data-quality-observability/profiler/auto-pii-tagging
   - category: How-to Guides / Data Quality and Observability / Data Observability
     url: /how-to-guides/data-quality-observability/observability
   - category: How-to Guides / Data Quality and Observability / Data Observability / Observability Alerts
@@ -922,6 +920,12 @@ site_menu:
     url: /how-to-guides/data-governance/classification/request-tags
   - category: How-to Guides / Data Governance / Classification / Auto-Classification in OpenMetadata
     url: /how-to-guides/data-governance/classification/auto
+  - category: How-to Guides / Data Governance / Classification / Auto-Classification in OpenMetadata / Workflow
+    url: /how-to-guides/data-governance/classification/auto/workflow
+  - category: How-to Guides / Data Governance / Classification / Auto-Classification in OpenMetadata / External Workflow
+    url: /how-to-guides/data-governance/classification/auto/external-workflow
+  - category: How-to Guides / Data Governance / Classification / Auto-Classification in OpenMetadata / Auto PII Tagging
+    url: /how-to-guides/data-governance/classification/auto/auto-pii-taggings
   - category: How-to Guides / Data Governance / Classification / What are Tiers
     url: /how-to-guides/data-governance/classification/tiers
   - category: How-to Guides / Data Governance / Classification / Best Practices for Classification

diff --git a/openmetadata-docs/content/v1.7.x-SNAPSHOT/collate-menu.md b/openmetadata-docs/content/v1.7.x-SNAPSHOT/collate-menu.md
@@ -689,8 +689,6 @@ site_menu:
   - category: How-to Guides / Data Quality and Observability / Data Profiler / Sample Data
     url: /how-to-guides/data-quality-observability/profiler/external-sample-data
   - category: How-to Guides / Data Quality and Observability / Data Profiler / External Workflow
-    url: /how-to-guides/data-quality-observability/profiler/external-workflow
-  - category: How-to Guides / Data Quality and Observability / Data Profiler / Auto PII Tagging
     url: /how-to-guides/data-quality-observability/profiler/auto-pii-tagging
   - category: How-to Guides / Data Quality and Observability / Data Observability
     url: /how-to-guides/data-quality-observability/observability
@@ -777,6 +775,12 @@ site_menu:
     url: /how-to-guides/data-governance/classification/request-tags
   - category: How-to Guides / Data Governance / Classification / Auto-Classification in OpenMetadata
     url: /how-to-guides/data-governance/classification/auto
+  - category: How-to Guides / Data Governance / Classification / Auto-Classification in OpenMetadata / Workflow
+    url: /how-to-guides/data-governance/classification/auto/workflow
+  - category: How-to Guides / Data Governance / Classification / Auto-Classification in OpenMetadata / External Workflow
+    url: /how-to-guides/data-governance/classification/auto/external-workflow
+  - category: How-to Guides / Data Governance / Classification / Auto-Classification in OpenMetadata / Auto PII Tagging
+    url: /how-to-guides/data-governance/classification/auto/auto-pii-tagging
   - category: How-to Guides / Data Governance / Classification / What are Tiers
     url: /how-to-guides/data-governance/classification/tiers
   - category: How-to Guides / Data Governance / Classification / Best Practices for Classification

diff --git a/...o-guides/data-governance/classification/Auto Classification/auto-pii-tagging.md b/...o-guides/data-governance/classification/Auto Classification/auto-pii-tagging.md
@@ -0,0 +1,47 @@
+---
+title: Auto PII Tagging
+slug: /how-to-guides/data-governance/classification/auto/auto-pii-tagging
+---
+
+# Auto PII Tagging
+
+Auto PII tagging for Sensitive/NonSensitive at the column level is performed based on the two approaches described below.
+
+{% note %}
+PII Tagging is only available during `Profiler Ingestion`.
+{% /note %}
+
+
+## Tagging logic
+
+1. **Column Name Scanner**: We validate the column names of the table against a set of regex rules that help us identify
+    common English patterns to identify email addresses, SSN, bank accounts, etc.
+2. **Entity Recognition**: If the sample data ingestion is enabled, we'll validate the sample rows against an Entity
+    Recognition engine that will bring up any sensitive information from a list of [supported entities](https://microsoft.github.io/presidio/supported_entities/).
+    In that case, the `confidence` parameter lets you tune the minimum score required to tag a column as `PII.Sensitive`.
+
+Note that if a column is already tagged as `PII`, we will ignore its execution.
+
+## Troubleshooting
+
+### SSL: CERTIFICATE_VERIFY_FAILED
+
+If you see an error similar to:
+
+```
+Unexpected error while processing sample data for auto pii tagging - HTTPSConnectionPool(host='raw.githubusercontent.com', port=443):
+Max retries exceeded with url: /explosion/spacy-models/master/compatibility.json 
+(Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to 
+get local issuer certificate (_ssl.c:1129)')))
+```
+
+This is a scenario that we identified on some corporate Windows laptops. The bottom-line here is that the profiler
+is trying to download the Entity Recognition model but having certificate issues when trying the request.
+
+A solution here is to manually download the model on the ingestion container / Airflow host by running:
+
+```
+pip --trusted-host github.com --trusted-host objects.githubusercontent.com install https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.5.0/en_core_web_md-3.5.0.tar.gz
+```
+
+If using Docker, you might want to customize the `openmetadata-ingestion` image to have this command run there by default.