Technical Debt: Pass 1 (#136)

* black + basiccsv * Fix test_dummy * Fix drycal * Tests pass or xfail * main & subcommands tidy-up * Reorganise extractors * purge zoneinfo * init py * extractor.filetype is the filetype. * Un-xfail tests. * black * Flake * Documentation tidy-up.
dgbowl · Mar 27, 2024 · b19788c · b19788c
1 parent 1b2c7f8
commit b19788c
Show file tree

Hide file tree

Showing 89 changed files with 807 additions and 503 deletions.
diff --git a/.gitignore b/.gitignore
@@ -6,5 +6,5 @@ dist/
 *.egg-info
 *.egg
 build/
-public/
+/public/
 docs/source/apidoc/
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -20,7 +20,7 @@
 # -- Project information -----------------------------------------------------
 
 project = "yadg"
-copyright = "2021 - 2023, yadg authors"
+copyright = "2021 - 2024, yadg authors"
 author = "Peter Kraus"
 release = version
 
@@ -39,6 +39,7 @@
     "sphinx_autodoc_typehints",
     "sphinx_rtd_theme",
     "sphinxcontrib.autodoc_pydantic",
+    "sphinxcontrib.mermaid",
 ]
 
 # Add any paths that contain templates here, relative to this directory.

diff --git a/docs/source/extractors.rst b/docs/source/extractors.rst
@@ -2,10 +2,14 @@
    :maxdepth: 1
    :caption: yadg extractors
    :hidden:
+   :glob:
 
-   apidoc/yadg.extractors.agilentch
-   apidoc/yadg.extractors.agilentdx
-   apidoc/yadg.extractors.eclabmpr
-   apidoc/yadg.extractors.eclabmpt
-   apidoc/yadg.extractors.panalyticalxrdml
-   apidoc/yadg.extractors.phispe
+   apidoc/yadg.extractors.public.*
+
+.. toctree::
+   :maxdepth: 1
+   :caption: yadg custom extractors
+   :hidden:
+   :glob:
+
+   apidoc/yadg.extractors.custom.*
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -7,20 +7,11 @@
 .. image:: https://badgen.net/github/tag/dgbowl/yadg/?icon=github
    :target: https://github.com/dgbowl/yadg
 
-**yadg** is a set of tools and parsers aimed to process raw instrument data. Given an experiment represented by a `dataschema`, **yadg** will process the files and folders specified in this `dataschema`, and produce a `datagram`, which is a unified data structure containing all measured ("raw") data in a given experiment. The `parsers` available in **yadg** are shown in the sidebar. As of ``yadg-5.0``, the `datagram` is stored as a |NetCDF|_ file. The produced `datagram` is associated with full provenance info, and the data within the `datagram` contain instrumental error estimates and are annotated with units. You can read more about **yadg** in our paper: [Kraus2022b]_.
+**yadg** is a set of tools and parsers aimed to :ref:`extract<extractor mode>` and standardise data from raw files generated by scientific instruments. The supported types of files that can be extracted are listed in the sidebar. The data (or metadata) extracted from the supplied file is returned as a :class:`xarray.Dataset` or a |NetCDF|_ file.
 
+For extracting and combining data from multiple files, **yadg** can be used to :ref:`process<parser mode>` a special configuration file called :mod:`~dgbowl_schemas.yadg.dataschema`. The combined data is returned as a :class:`datatree.DataTree` or a |NetCDF|_ file. This allows reproducible processing of structured experimental data, and takes care of issues such as timezone resolution, unit annotation, uncertainty determination, and keeps track of provenance.
 
-.. image:: images/schema_yadg_datagram.png
-   :width: 600
-   :alt: yadg is used to process raw data files using a datadchema into a NetCDF datagram.
-
-
-Some of the **yadg** parsers are exposed via an `extractor` interface, allowing the user to extract (meta)-data from individual files without requiring a `dataschema`. Several file formats are supported, as shown in the sidebar. You can read more about this `extractor` interface on the |marda_extractors|_ website, as well as in the :ref:`Usage: Extractor mode<extractor mode>` section of this documentation.
-
-.. warning::
-
-   All of the post-processing features within **yadg** have been removed in ``yadg-5.0``, following their deprecation in ``yadg-4.2``. If you are looking for a post-processing library, have a look at |dgpost|_ instead.
-
+For more details about **yadg** usage, see :ref:`the usage instructions<usage>`. You can read more about **yadg** in our paper: [Kraus2022b]_.
 
 Contributors
 ````````````
@@ -46,8 +37,6 @@ The project is also part of BATTERY 2030+, the large-scale European research ini
    features
    citing
 
-.. include:: parsers.rst
-
 .. include:: extractors.rst
 
 .. toctree::

diff --git a/docs/source/parsers.rst b/docs/source/parsers.rst
diff --git a/docs/source/usage.rst b/docs/source/usage.rst
@@ -1,3 +1,5 @@
+.. _usage:
+
 How to use **yadg**
 ===================
 We have prepared an interactive, Binder-compatible Jupyter notebook, showing the installation and example usage of **yadg**. The latest version of the notebook and the direct link to Binder are:

diff --git a/docs/source/version.5_1.rst b/docs/source/version.5_1.rst
@@ -0,0 +1,22 @@
+**yadg** version 5.1
+``````````````````````
+.. image:: https://img.shields.io/static/v1?label=yadg&message=v5.1&color=blue&logo=github
+  :target: https://github.com/PeterKraus/yadg/tree/5.1
+.. image:: https://img.shields.io/static/v1?label=yadg&message=v5.1&color=blue&logo=pypi
+  :target: https://pypi.org/project/yadg/5.1/
+.. image:: https://img.shields.io/static/v1?label=release%20date&message=2024-XX-YY&color=red&logo=pypi
+
+
+Developed in the |concat_lab|_ at Technische Universität Berlin (Berlin, DE).
+
+New features since ``yadg-5.0`` are:
+
+Other changes in ``yadg-5.1`` are:
+
+  - The dataschema has been simplified, eliminating parsers in favour of extractors.
+  - The code has been reorganised to highlight the extractor functionality in favour of parsers.
+
+
+.. _concat_lab: https://tu.berlin/en/concat
+
+.. |concat_lab| replace:: ConCat Lab
diff --git a/docs/source/version.rst b/docs/source/version.rst
@@ -1,6 +1,8 @@
 **yadg** version history
 ------------------------
 
+.. include:: version.5_1.rst
+
 .. include:: version.5_0.rst
 
 .. include:: version.4_2.rst

diff --git a/setup.cfg b/setup.cfg
@@ -4,4 +4,8 @@ style = pep440-pre
 versionfile_source = src/yadg/_version.py
 versionfile_build = yadg/_version.py
 tag_prefix =
-parentdir_prefix = yadg-
+parentdir_prefix = yadg-
+
+[flake8]
+max-line-length = 88
+extend-ignore = E203
diff --git a/setup.py b/setup.py
@@ -45,7 +45,7 @@
         "openpyxl>=3.0.0",
         "h5netcdf~=1.0",
         "xarray-datatree==0.0.12",
-        "dgbowl-schemas>=116",
+        "dgbowl-schemas @ git+https://github.com/dgbowl/dgbowl-schemas.git@dataschema_5.1",
         "requests",
     ],
     extras_require={
@@ -55,6 +55,7 @@
             "sphinx-rtd-theme~=1.3.0",
             "sphinx-autodoc-typehints < 1.20.0",
             "autodoc-pydantic>=2.0.0",
+            "sphinxcontrib-mermaid~=0.9.2",
         ],
     },
     entry_points={"console_scripts": ["yadg=yadg:run_with_arguments"]},

diff --git a/src/yadg/core.py b/src/yadg/core.py
@@ -0,0 +1,167 @@
+from importlib import metadata
+import logging
+import importlib
+import xarray as xr
+import numpy as np
+from typing import Callable
+from datatree import DataTree
+from xarray import Dataset
+from pydantic import BaseModel
+
+from dgbowl_schemas.yadg.dataschema import DataSchema
+from yadg import dgutils
+
+datagram_version = metadata.version("yadg")
+logger = logging.getLogger(__name__)
+
+
+def infer_extractor(extractor: str) -> Callable:
+    """
+    A function that finds an :func:`extract` function of the supplied ``extractor``.
+
+    """
+    modnames = [
+        f"yadg.extractors.public.{extractor}",
+        f"yadg.extractors.custom.{extractor}",
+        f"yadg.extractors.{extractor.replace('.','')}",
+    ]
+    for modname in modnames:
+        try:
+            m = importlib.import_module(modname)
+            if hasattr(m, "extract"):
+                return getattr(m, "extract")
+        except ImportError:
+            logger.critical(f"could not import module '{modname}'")
+    raise RuntimeError
+
+
+def process_schema(dataschema: DataSchema, strict_merge: bool = False) -> DataTree:
+    """
+    The main processing function of yadg.
+
+    Takes in a :class:`DataSchema` object and returns a single :class:`DataTree` created
+    from the :class:`DataSchema`.
+
+    """
+    if strict_merge:
+        concatmode = "identical"
+    else:
+        concatmode = "drop_conflicts"
+
+    while hasattr(dataschema, "update"):
+        dataschema = dataschema.update()
+
+    root = DataTree()
+    root.attrs = {
+        "provenance": "yadg process",
+        "date": dgutils.now(asstr=True),
+        "input_schema": dataschema.model_dump_json(),
+        "datagram_version": datagram_version,
+    }
+    root.attrs.update(dgutils.get_yadg_metadata())
+
+    for si, step in enumerate(dataschema.steps):
+        logger.info(f"Processing step {si}.")
+
+        # Backfill default timezone, locale, encoding.
+        if step.extractor.timezone is None:
+            step.extractor.timezone = dataschema.step_defaults.timezone
+
+        if step.extractor.locale is None:
+            step.extractor.locale = dataschema.step_defaults.locale
+        if step.extractor.encoding is None:
+            step.extractor.encoding = dataschema.step_defaults.encoding
+
+        sattrs = {"extractor_schema": step.extractor.model_dump_json(exclude_none=True)}
+
+        if step.tag is None:
+            step.tag = f"{si}"
+
+        handler = infer_extractor(step.extractor.filetype)
+        todofiles = step.input.paths()
+        vals = None
+        if len(todofiles) == 0:
+            logger.warning(f"No files processed by step '{step.tag}'.")
+            vals = {}
+        for tf in todofiles:
+            logger.info(f"Processing file '{tf}'.")
+            ret = handler(fn=tf, **vars(step.extractor))
+            if isinstance(ret, DataTree):
+                tasks = ret.to_dict()
+            elif isinstance(ret, Dataset):
+                tasks = {"/": ret}
+            else:
+                raise RuntimeError(type(ret))
+            fvals = {}
+            for name, dset in tasks.items():
+                if name == "/" and len(dset.variables) == 0:
+                    # The root datatree node may sometimes carry metadata, even if
+                    # there are no variables - we don't add 'uts' to those.
+                    fvals[name] = dset
+                else:
+                    fvals[name] = complete_uts(
+                        dset, tf, step.externaldate, step.extractor.timezone
+                    )
+            vals = merge_dicttrees(vals, fvals, concatmode)
+
+        stepdt = DataTree.from_dict({} if vals is None else vals)
+        stepdt.name = step.tag
+        stepdt.attrs = sattrs
+        stepdt.parent = root
+    return root
+
+
+def complete_uts(
+    ds: Dataset,
+    filename: str,
+    externaldate: BaseModel,
+    timezone: str,
+) -> Dataset:
+    """
+    A helper function ensuring that the Dataset ``ds`` contains a dimension ``"uts"``,
+    and that the timestamps in ``"uts"`` are completed as instructed in the
+    ``externaldate`` specification.
+
+    """
+    if not hasattr(ds, "uts"):
+        ds = ds.expand_dims("uts")
+    if len(ds.uts.coords) == 0:
+        ds["uts"] = np.zeros(ds.uts.size)
+        ds.attrs["fulldate"] = False
+    if not ds.attrs.get("fulldate", True) or externaldate is not None:
+        ts, fulldate = dgutils.complete_timestamps(
+            timesteps=ds.uts.values,
+            fn=filename,
+            spec=externaldate,
+            timezone=timezone,
+        )
+        ds["uts"] = ts
+        if fulldate:
+            ds.attrs.pop("fulldate", None)
+        else:
+            # cannot store booleans in NetCDF files
+            ds.attrs["fulldate"] = int(fulldate)
+
+    return ds
+
+
+def merge_dicttrees(vals: dict, fvals: dict, mode: str) -> dict:
+    """
+    A helper function that merges two ``DataTree.to_dict()`` objects by concatenating
+    the new values in ``fvals`` to the existing ones in ``vals``.
+
+    """
+    if vals is None:
+        return fvals
+    for k in fvals.keys():
+        try:
+            vals[k] = xr.concat([vals[k], fvals[k]], dim="uts", combine_attrs=mode)
+        except xr.MergeError:
+            raise RuntimeError(
+                "Merging metadata from multiple files has failed, as some of the "
+                "values differ between files. This might be caused by trying to "
+                "parse data obtained using different techniques/protocols in a "
+                "single step. If you are certain this is what you want, try using "
+                "yadg with the '--ignore-merge-errors' option."
+            )
+    return vals