Skip to content

Commit

Permalink
Technical Debt: Pass 1 (#136)
Browse files Browse the repository at this point in the history
* black + basiccsv

* Fix test_dummy

* Fix drycal

* Tests pass or xfail

* main & subcommands tidy-up

* Reorganise extractors

* purge zoneinfo

* init py

* extractor.filetype is the filetype.

* Un-xfail tests.

* black

* Flake

* Documentation tidy-up.
  • Loading branch information
PeterKraus authored Mar 27, 2024
1 parent 1b2c7f8 commit b19788c
Show file tree
Hide file tree
Showing 89 changed files with 807 additions and 503 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,5 @@ dist/
*.egg-info
*.egg
build/
public/
/public/
docs/source/apidoc/
3 changes: 2 additions & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
# -- Project information -----------------------------------------------------

project = "yadg"
copyright = "2021 - 2023, yadg authors"
copyright = "2021 - 2024, yadg authors"
author = "Peter Kraus"
release = version

Expand All @@ -39,6 +39,7 @@
"sphinx_autodoc_typehints",
"sphinx_rtd_theme",
"sphinxcontrib.autodoc_pydantic",
"sphinxcontrib.mermaid",
]

# Add any paths that contain templates here, relative to this directory.
Expand Down
16 changes: 10 additions & 6 deletions docs/source/extractors.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,14 @@
:maxdepth: 1
:caption: yadg extractors
:hidden:
:glob:

apidoc/yadg.extractors.agilentch
apidoc/yadg.extractors.agilentdx
apidoc/yadg.extractors.eclabmpr
apidoc/yadg.extractors.eclabmpt
apidoc/yadg.extractors.panalyticalxrdml
apidoc/yadg.extractors.phispe
apidoc/yadg.extractors.public.*

.. toctree::
:maxdepth: 1
:caption: yadg custom extractors
:hidden:
:glob:

apidoc/yadg.extractors.custom.*
17 changes: 3 additions & 14 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,20 +7,11 @@
.. image:: https://badgen.net/github/tag/dgbowl/yadg/?icon=github
:target: https://github.com/dgbowl/yadg

**yadg** is a set of tools and parsers aimed to process raw instrument data. Given an experiment represented by a `dataschema`, **yadg** will process the files and folders specified in this `dataschema`, and produce a `datagram`, which is a unified data structure containing all measured ("raw") data in a given experiment. The `parsers` available in **yadg** are shown in the sidebar. As of ``yadg-5.0``, the `datagram` is stored as a |NetCDF|_ file. The produced `datagram` is associated with full provenance info, and the data within the `datagram` contain instrumental error estimates and are annotated with units. You can read more about **yadg** in our paper: [Kraus2022b]_.
**yadg** is a set of tools and parsers aimed to :ref:`extract<extractor mode>` and standardise data from raw files generated by scientific instruments. The supported types of files that can be extracted are listed in the sidebar. The data (or metadata) extracted from the supplied file is returned as a :class:`xarray.Dataset` or a |NetCDF|_ file.

For extracting and combining data from multiple files, **yadg** can be used to :ref:`process<parser mode>` a special configuration file called :mod:`~dgbowl_schemas.yadg.dataschema`. The combined data is returned as a :class:`datatree.DataTree` or a |NetCDF|_ file. This allows reproducible processing of structured experimental data, and takes care of issues such as timezone resolution, unit annotation, uncertainty determination, and keeps track of provenance.

.. image:: images/schema_yadg_datagram.png
:width: 600
:alt: yadg is used to process raw data files using a datadchema into a NetCDF datagram.


Some of the **yadg** parsers are exposed via an `extractor` interface, allowing the user to extract (meta)-data from individual files without requiring a `dataschema`. Several file formats are supported, as shown in the sidebar. You can read more about this `extractor` interface on the |marda_extractors|_ website, as well as in the :ref:`Usage: Extractor mode<extractor mode>` section of this documentation.

.. warning::

All of the post-processing features within **yadg** have been removed in ``yadg-5.0``, following their deprecation in ``yadg-4.2``. If you are looking for a post-processing library, have a look at |dgpost|_ instead.

For more details about **yadg** usage, see :ref:`the usage instructions<usage>`. You can read more about **yadg** in our paper: [Kraus2022b]_.

Contributors
````````````
Expand All @@ -46,8 +37,6 @@ The project is also part of BATTERY 2030+, the large-scale European research ini
features
citing

.. include:: parsers.rst

.. include:: extractors.rst

.. toctree::
Expand Down
16 changes: 0 additions & 16 deletions docs/source/parsers.rst

This file was deleted.

2 changes: 2 additions & 0 deletions docs/source/usage.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.. _usage:

How to use **yadg**
===================
We have prepared an interactive, Binder-compatible Jupyter notebook, showing the installation and example usage of **yadg**. The latest version of the notebook and the direct link to Binder are:
Expand Down
22 changes: 22 additions & 0 deletions docs/source/version.5_1.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
**yadg** version 5.1
``````````````````````
.. image:: https://img.shields.io/static/v1?label=yadg&message=v5.1&color=blue&logo=github
:target: https://github.com/PeterKraus/yadg/tree/5.1
.. image:: https://img.shields.io/static/v1?label=yadg&message=v5.1&color=blue&logo=pypi
:target: https://pypi.org/project/yadg/5.1/
.. image:: https://img.shields.io/static/v1?label=release%20date&message=2024-XX-YY&color=red&logo=pypi


Developed in the |concat_lab|_ at Technische Universität Berlin (Berlin, DE).

New features since ``yadg-5.0`` are:

Other changes in ``yadg-5.1`` are:

- The dataschema has been simplified, eliminating parsers in favour of extractors.
- The code has been reorganised to highlight the extractor functionality in favour of parsers.


.. _concat_lab: https://tu.berlin/en/concat

.. |concat_lab| replace:: ConCat Lab
2 changes: 2 additions & 0 deletions docs/source/version.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
**yadg** version history
------------------------

.. include:: version.5_1.rst

.. include:: version.5_0.rst

.. include:: version.4_2.rst
Expand Down
6 changes: 5 additions & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,8 @@ style = pep440-pre
versionfile_source = src/yadg/_version.py
versionfile_build = yadg/_version.py
tag_prefix =
parentdir_prefix = yadg-
parentdir_prefix = yadg-

[flake8]
max-line-length = 88
extend-ignore = E203
3 changes: 2 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@
"openpyxl>=3.0.0",
"h5netcdf~=1.0",
"xarray-datatree==0.0.12",
"dgbowl-schemas>=116",
"dgbowl-schemas @ git+https://github.com/dgbowl/dgbowl-schemas.git@dataschema_5.1",
"requests",
],
extras_require={
Expand All @@ -55,6 +55,7 @@
"sphinx-rtd-theme~=1.3.0",
"sphinx-autodoc-typehints < 1.20.0",
"autodoc-pydantic>=2.0.0",
"sphinxcontrib-mermaid~=0.9.2",
],
},
entry_points={"console_scripts": ["yadg=yadg:run_with_arguments"]},
Expand Down
167 changes: 167 additions & 0 deletions src/yadg/core.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
from importlib import metadata
import logging
import importlib
import xarray as xr
import numpy as np
from typing import Callable
from datatree import DataTree
from xarray import Dataset
from pydantic import BaseModel

from dgbowl_schemas.yadg.dataschema import DataSchema
from yadg import dgutils

datagram_version = metadata.version("yadg")
logger = logging.getLogger(__name__)


def infer_extractor(extractor: str) -> Callable:
"""
A function that finds an :func:`extract` function of the supplied ``extractor``.
"""
modnames = [
f"yadg.extractors.public.{extractor}",
f"yadg.extractors.custom.{extractor}",
f"yadg.extractors.{extractor.replace('.','')}",
]
for modname in modnames:
try:
m = importlib.import_module(modname)
if hasattr(m, "extract"):
return getattr(m, "extract")
except ImportError:
logger.critical(f"could not import module '{modname}'")
raise RuntimeError


def process_schema(dataschema: DataSchema, strict_merge: bool = False) -> DataTree:
"""
The main processing function of yadg.
Takes in a :class:`DataSchema` object and returns a single :class:`DataTree` created
from the :class:`DataSchema`.
"""
if strict_merge:
concatmode = "identical"
else:
concatmode = "drop_conflicts"

while hasattr(dataschema, "update"):
dataschema = dataschema.update()

root = DataTree()
root.attrs = {
"provenance": "yadg process",
"date": dgutils.now(asstr=True),
"input_schema": dataschema.model_dump_json(),
"datagram_version": datagram_version,
}
root.attrs.update(dgutils.get_yadg_metadata())

for si, step in enumerate(dataschema.steps):
logger.info(f"Processing step {si}.")

# Backfill default timezone, locale, encoding.
if step.extractor.timezone is None:
step.extractor.timezone = dataschema.step_defaults.timezone

if step.extractor.locale is None:
step.extractor.locale = dataschema.step_defaults.locale
if step.extractor.encoding is None:
step.extractor.encoding = dataschema.step_defaults.encoding

sattrs = {"extractor_schema": step.extractor.model_dump_json(exclude_none=True)}

if step.tag is None:
step.tag = f"{si}"

handler = infer_extractor(step.extractor.filetype)
todofiles = step.input.paths()
vals = None
if len(todofiles) == 0:
logger.warning(f"No files processed by step '{step.tag}'.")
vals = {}
for tf in todofiles:
logger.info(f"Processing file '{tf}'.")
ret = handler(fn=tf, **vars(step.extractor))
if isinstance(ret, DataTree):
tasks = ret.to_dict()
elif isinstance(ret, Dataset):
tasks = {"/": ret}
else:
raise RuntimeError(type(ret))
fvals = {}
for name, dset in tasks.items():
if name == "/" and len(dset.variables) == 0:
# The root datatree node may sometimes carry metadata, even if
# there are no variables - we don't add 'uts' to those.
fvals[name] = dset
else:
fvals[name] = complete_uts(
dset, tf, step.externaldate, step.extractor.timezone
)
vals = merge_dicttrees(vals, fvals, concatmode)

stepdt = DataTree.from_dict({} if vals is None else vals)
stepdt.name = step.tag
stepdt.attrs = sattrs
stepdt.parent = root
return root


def complete_uts(
ds: Dataset,
filename: str,
externaldate: BaseModel,
timezone: str,
) -> Dataset:
"""
A helper function ensuring that the Dataset ``ds`` contains a dimension ``"uts"``,
and that the timestamps in ``"uts"`` are completed as instructed in the
``externaldate`` specification.
"""
if not hasattr(ds, "uts"):
ds = ds.expand_dims("uts")
if len(ds.uts.coords) == 0:
ds["uts"] = np.zeros(ds.uts.size)
ds.attrs["fulldate"] = False
if not ds.attrs.get("fulldate", True) or externaldate is not None:
ts, fulldate = dgutils.complete_timestamps(
timesteps=ds.uts.values,
fn=filename,
spec=externaldate,
timezone=timezone,
)
ds["uts"] = ts
if fulldate:
ds.attrs.pop("fulldate", None)
else:
# cannot store booleans in NetCDF files
ds.attrs["fulldate"] = int(fulldate)

return ds


def merge_dicttrees(vals: dict, fvals: dict, mode: str) -> dict:
"""
A helper function that merges two ``DataTree.to_dict()`` objects by concatenating
the new values in ``fvals`` to the existing ones in ``vals``.
"""
if vals is None:
return fvals
for k in fvals.keys():
try:
vals[k] = xr.concat([vals[k], fvals[k]], dim="uts", combine_attrs=mode)
except xr.MergeError:
raise RuntimeError(
"Merging metadata from multiple files has failed, as some of the "
"values differ between files. This might be caused by trying to "
"parse data obtained using different techniques/protocols in a "
"single step. If you are certain this is what you want, try using "
"yadg with the '--ignore-merge-errors' option."
)
return vals
Loading

0 comments on commit b19788c

Please sign in to comment.