Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add cli command 'mara catalog connect' #3

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

leo-schick
Copy link
Member

@leo-schick leo-schick commented Nov 21, 2023

A command which connects a data lake with a database engine by executing the required SQL commands to tell the db engine where the data is placed on the storage:

(.venv) [add-catalog-connnect-command][~/git/mara/mara-catalog]$ mara catalog connect --help
Usage: mara catalog connect [OPTIONS]

  Connects a data lake or lakehouse catalog to a database

  Args:     catalog: The catalog to connect to. If not set, all configured
  catalogs will be connected.     db_alias: The db alias the catalog shall be
  connected to. If not set, the default db alias is taken.

      disable_colors: If true, don't use escape sequences to make the log
      colorful (default: colorful logging)

Options:
  --catalog TEXT    The catalog to connect. If not given, all catalogs will be
                    connected.
  --db-alias TEXT   The database the catalog(s) shall be connected to. If not
                    given, the default db alias is used.
  --disable-colors  Output logs without coloring them.
  --help            Show this message and exit.

Current supported database engines:

  • SQL Server Polybase
  • Azure Synapse
  • Databricks
  • Snowflake

Example use case:

You have a data lake with a hadoop-like following folder structure:

/<table_name>/<file_part1...partN].[csv|tsv|json|parquet|ocr|avro]  # typical Hadoop folder structure
/<schema_name>/<table_name>/<file_part1...partN].[csv|tsv|json|parquet|ocr|avro]   # folder structure with schema
/<table_name>/_delta_log    # a delta table (requires pypi package deltalake, not in the default requirements)

e.g.
/sales/invoice/part1.parquet
/sales/invoices/part2.parquet
/sales/invoices/part3.parquet
/sales/orders/part1.parquet
/sales/sales_datamart/part1.parquet
/sales/sales_datamart/part2.parquet

Now you want to use the tables in a database engine. It will require some time for you to create all the metadata in the database engine. With the mara catalog connect command, it can create the required metadata objects so that you can run queryies against the tables:

define in your mara_config.py the catalog:

from mara_app.monkey_patch import patch

from mara_storage.storages import LocalStorage
import mara_storage.config

from mara_catalog.catalog import StorageCatalog
import mara_catalog.config

patch(mara_storage.config.storages)(lambda: {
    'datalake': LocalStorage(base_path='/')
})

patch(mara_db.config.databases)(lambda: {
    'query-engine': SqlcmdSQLServerDB(...),
})

patch(mara_catalog.config.catalogs)(lambda: {
    'sales_data': StorageCatalog(storage_alias='datalake', base_path='sales', default_schema='sales'),
    # or 
    #'sales_data': StorageCatalog(storage_alias='datalake', base_path='.', has_schemas=True),
})

Now you just need to run:

mara catalog connect --catalog sales_data --db-alias query-engine

The command will

  • discover all tables in your data lake base_path
  • if required, it detects automatically the schema.
  • create a pipeline with SQL commands to 1. create the schema if it does not exist and then 2. run SQL commands to tell the db engine metadata where to find the tables and with which columns (if required)
  • run the pipeline

Currently, the command runs in a create_or_replace mode which is typical for dbt. A typical mara mode woudl be replace_schema which is planned.

@leo-schick leo-schick force-pushed the add-catalog-connnect-command branch from e991150 to c5290d1 Compare November 21, 2023 13:55
@leo-schick
Copy link
Member Author

@jankatins maybe you cold throw a quick look a this. Final documentation is still missing but from the description and code it should be easy to get an idea about it. Just if you'd like

@leo-schick leo-schick requested a review from jankatins December 7, 2023 16:30
@leo-schick
Copy link
Member Author

This command is extremely helpful when setting up new environments or when you want to query your data lake with different query engines. PostgreSQL could be supported with e.g. parquet_fdw; but is out of scope for me. Maybe someone else whats to implement it...

@jankatins
Copy link
Member

jankatins commented Dec 7, 2023

I tried to understand what this does and I still am a bit clueless here: If I understand the above right, it basically does some magic to discover datasets from files and sets up a pipeline to copy it over?

If that's the case, I feel that this is mixing a few concepts: I know in aws glue, a catalog is something which describes datasets/ables which are (mostly) parquet files in s3 (so basically files are represented as a "DB"), so from "connect" I would expect that it can connect to that (already existing) AWS glue catalog. I would also expect that "connect" does not do any actions.

  • create catalog (creates an empty catalog)
  • catalog: discover from filesystem/ ... (basically the pedant to an AWS glue crawler)
  • connect to (external) catalog (to catalogs which exist in other systems, e.g. AWS glue) -> loads the metadata and makes all table available via the catalog interface, which would be kind of a DB?
  • Setup forwarders in one DB to the catalog tables?

This assumes that a pipeline with SQL commands to 1. create the schema if it does not exist and then 2. run SQL commands to tell the db engine metadata where to find the tables and with which columns (if required) is only about metadata and not about real data getting copied from somewhere to somewhere else. I would find it strange to have this kind of thing as a pipeline (which I can (or cannot?) see in the UI?) and not as a simple task or even just as a cli command.

So all in all, I still do not get what this is about because it doesn't adhere (or at least seems to not adhere) to the concepts I know from AWS glue catalogs.

@leo-schick
Copy link
Member Author

@jankatins
The name connect is indeed confusing but i did not come up yet with another name for this function. And yes, you seem to be a bit confused :-)

First of all, what is a catalog
A catalog is a mara representation of table (or "model") metadata. Currently without columns. This could be:

  1. a StorageCatalog (see above) - repesents the tables from a data lake or lakehouse. (Note that auto discover of tables is optional. You could decide to specify in code all tables you want to have in the catalog, like you have to specify all entities in the mara-schema package)
  2. a DbCatalog - repesenting table metadata in a database
  3. a DbtDbCatalog/DbtStorageCatalog - representing the available models in a dbt project, taken from a manifest file

It is not decided what you want to do with this catalog information at this point. Here some ideas:

  • show the catalog in the UI
  • cache/load a "discovered" catalog from disk, e.g. in a JSON file
  • you want to do a dbt fashion like mara catalog run -s <catalog_name.model> command which can build your models. (woudl require dependency management + a sub pipeline per model).

The catalog itself is - first of all - just a python code representation of table metadata.

What is this function?
It is a helper function to inform a database system about where to find the models (so: creating/updating the metadata in the db engine). It is developed for only storage catalogs.

Just think of the dbt package dbt-external-tables:
There you speicify in yml files how a souce table looks like. Then you have the option to call dbt run-operation stage_external_sources to create these tables in the database engine so that you can use them.

In my case above, I use the "crawler function" to auto-discover the tables from the storage. This is a nice feature since many data bases do not support a crawler like AWS glue. Since I there is not yet any caching, I put this into a single command. But the main feature of this command is to update the metadata of the db engine (e.g. calling CREATE OR REPLACE EXTERNAL TABLE). The table discovery is just a feature on top which is given by this mara module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants