Skip to content

Data flow tool that transform your notebooks and python files into pipeline steps by standardizing the data input / output. [for Data science project]

License

Notifications You must be signed in to change notification settings

CyprienRicque/stdflow

Repository files navigation

stdflow

Data flow tool that transform your notebooks and python files into pipeline steps by standardizing the data input / output. (for data science projects)

Create clean data flow pipelines just by replacing your pd.read_csv() and df.to_csv() by sf.load() and sf.save().

Documentation

Install

pip install stdflow

How to use

Pipelines

from stdflow import StepRunner
from stdflow.pipeline import Pipeline

# Pipeline with 2 steps

dm = "../demo_project/notebooks/"

ingestion_ppl = Pipeline([
    StepRunner(dm + "01_ingestion/countries.ipynb"), 
    StepRunner(dm + "01_ingestion/world_happiness.ipynb")
])

# === OR ===
ingestion_ppl = Pipeline(
    StepRunner(dm + "01_ingestion/countries.ipynb"), 
    StepRunner(dm + "01_ingestion/world_happiness.ipynb")
)

# === OR ===
ingestion_ppl = Pipeline()

ingestion_ppl.add_step(StepRunner(dm + "01_ingestion/countries.ipynb"))
# OR
ingestion_ppl.add_step(dm + "01_ingestion/world_happiness.ipynb")


ingestion_ppl
================================
            PIPELINE            
================================

STEP 1
    path: ../demo_project/notebooks/01_ingestion/countries.ipynb
    vars: {}

STEP 2
    path: ../demo_project/notebooks/01_ingestion/world_happiness.ipynb
    vars: {}

================================

Run the pipeline

ingestion_ppl.run(verbose=True, kernel=":any_available")
=================================================================================
    01.                ../demo_project/notebooks/01_ingestion/countries.ipynb
=================================================================================
Variables: {}
using kernel:  python3
    Path: ../demo_project/notebooks/01_ingestion/countries.ipynb
    Duration: 0 days 00:00:00.603051
    Env: {}
Notebook executed successfully.


=================================================================================
    02.          ../demo_project/notebooks/01_ingestion/world_happiness.ipynb
=================================================================================
Variables: {}
using kernel:  python3
    Path: ../demo_project/notebooks/01_ingestion/world_happiness.ipynb
    Duration: 0 days 00:00:00.644909
    Env: {}
Notebook executed successfully.

Load and save data

Option 1: Specify All Parameters

import stdflow as sf
import pandas as pd


# load data from ../demo_project/data/countries/step_loaded/v_202309212245/countries.csv
df = sf.load(
   root="../demo_project/data/",
   attrs=['countries'],
   step='created',
   version=':last',  # loads last version in alphanumeric order
   file_name='countries.csv',
   method=pd.read_csv,  # or method='csv'
   verbose=False,
)

# export data to ./data/raw/twitter/france/step_processed/v_1/countries.csv
sf.save(
   df,
   root="../demo_project/data/",
   attrs='countries/',
   step='loaded',
   version='%Y-03',  # creates v_2023-03
   file_name='countries.csv',
   method=pd.DataFrame.to_csv,  # or method='csv'  or any function that takes the object to export as first input
)
attrs=countries/::step_name=loaded::version=2023-03::file_name=countries.csv

Each time you perform a save, a metadata.json file is created in the folder. This keeps track of how your data was created and other information.

Option 2: Use default variables

import stdflow as sf
sf.reset()  # used when multiple steps are done with the same Step object (not recommended). see below

# use package level default values
sf.root = "../demo_project/data/"
sf.attrs = 'countries'  # if needed use attrs_in and attrs_out
sf.step_in = 'loaded'
sf.step_out = 'formatted'

df = sf.load()
# ! root / attrs / step : used from default values set above
# ! version : the last version was automatically used. default: ":last"
# ! file_name : the file, alone in the folder, was automatically found
# ! method : was automatically used from the file extension

sf.save(df)
# ! root / attrs / step : used from default values set above
# ! version: used default %Y%m%d%H%M format
# ! file_name: used from the input (because only one file)
# ! method : inferred from file name
attrs=countries::step_name=formatted::version=202310101716::file_name=countries.csv

Note that everything we did at package level can be done with the Step class When you have multiple steps in a notebook, you can create one Step object per step. stdflow (sf) at package level is a singleton instance of Step.

from stdflow import Step

step = Step(
    root="../demo_project/data/",
    attrs='countries',
    step_in='formatted',
    step_out='pre_processed'
)
# or set after
step.root = "../demo_project/data/"
# ...

df = step.load(version=':last', file_name=":auto", verbose=True)

step.save(df, verbose=True)
INFO:stdflow.step:Loading data from ../demo_project/data/countries/step_formatted/v_202310101716/countries.csv
INFO:stdflow.step:Data loaded from ../demo_project/data/countries/step_formatted/v_202310101716/countries.csv
INFO:stdflow.step:Saving data to ../demo_project/data/countries/step_pre_processed/v_202310101716/countries.csv
INFO:stdflow.step:Data saved to ../demo_project/data/countries/step_pre_processed/v_202310101716/countries.csv
INFO:stdflow.step:Saving metadata to ../demo_project/data/countries/step_pre_processed/v_202310101716/

attrs=countries::step_name=pre_processed::version=202310101716::file_name=countries.csv

Each time you perform a save, a metadata.json file is created in the folder. This keeps track of how your data was created and other information.

Do not

  • Save in the same directory from different steps. Because this will erase metadata from the previous step.

Data visualization

import stdflow as sf

step.save(df, verbose=True, export_viz_tool=True)
INFO:stdflow.step:Saving data to ../demo_project/data/countries/step_pre_processed/v_202310101716/countries.csv
INFO:stdflow.step:Data saved to ../demo_project/data/countries/step_pre_processed/v_202310101716/countries.csv
INFO:stdflow.step:Saving metadata to ../demo_project/data/countries/step_pre_processed/v_202310101716/
INFO:stdflow.step:Exporting viz tool to ../demo_project/data/countries/step_pre_processed/v_202310101716/

attrs=countries::step_name=pre_processed::version=202310101716::file_name=countries.csv

This command exports a folder metadata_viz in the same folder as the data you exported. The metadata to display is saved in the metadata.json file.

In order to display it you need to get both the file and the folder on your local pc (download if you are working on a server)

Then go to the html file in your file explorer and open it. it should open in your browser and lets you upload the metadata.json file.

Data flow tool that transform your notebooks and python files into pipeline steps by standardizing the data input / output. (for data science projects)

Create clean data flow pipelines just by replacing your pd.read_csv() and df.to_csv() by sf.load() and sf.save().

Data Organization

Format

Data folder organization is systematic and used by the function to load and save. If follows this format: root_data_folder/attrs_1/attrs_2/…/attrs_n/step_name/version/file_name

where:

  • root_data_folder: is the path to the root of your data folder, and is not exported in the metadata
  • attrs: information to classify your dataset (e.g. country, client, …)
  • step_name: name of the step. always starts with step_
  • version: version of the step. always starts with v_
  • file_name: name of the file. can be anything

Each folder is the output of a step. It contains a metadata.json file with information about all files in the folder and how it was generated. It can also contain a html page (if you set html_export=True in save()) that lets you visualize the pipeline and your metadata

Best Practices:

  • Do not use sf.reset as part of your final code
  • In one step, export only to one path (except the version). meaning for one step only one combination of attrs and step_name
  • Do not set sub-dirs within the export (i.e. version folder is the last depth). if you need similar operation for different datasets, create pipelines

About

Data flow tool that transform your notebooks and python files into pipeline steps by standardizing the data input / output. [for Data science project]

Resources

License

Stars

Watchers

Forks

Packages

No packages published