Skip to content

Commit

Permalink
Add datatree, example from proper json.
Browse files Browse the repository at this point in the history
  • Loading branch information
PeterKraus committed Oct 15, 2023
1 parent 483cf3c commit 777162c
Show file tree
Hide file tree
Showing 8 changed files with 61 additions and 45 deletions.
1 change: 1 addition & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,4 +86,5 @@
intersphinx_mapping = {
"dgbowl_schemas": ("https://dgbowl.github.io/dgbowl-schemas/master", None),
"xarray": ("https://docs.xarray.dev/en/stable", None),
"datatree": ("https://xarray-datatree.readthedocs.io/en/latest/", None),
}
26 changes: 26 additions & 0 deletions docs/source/dataschema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"metadata": {
"version": "5.0",
"provenance": {"type": "manual"}
},
"steps": [
{
"tag": "flow",
"parser": "basiccsv",
"input": {"files": ["foo.csv"]},
"extractor": {"filetype": "None"},
"parameters": {"sep": ","}
},
{
"parser": "basiccsv",
"input": {"files": ["bar.csv"]},
"extractor": {"filetype": "None"},
"parameters": {"sep": ","}
},
{
"parser": "chromtrace",
"input": {"files": ["./GC/"]},
"extractor": {"filetype": "fusion.json"}
}
]
}
12 changes: 6 additions & 6 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,15 @@
.. image:: https://badgen.net/github/tag/dgbowl/yadg/?icon=github
:target: https://github.com/dgbowl/yadg

**yadg** is a set of tools and parsers aimed to process raw instrument data. Given an experiment represented by a `dataschema`, **yadg** will process the files and folders specified in this `dataschema`, and produce a `datagram` -- a unified data structure containing all measured ("raw") data in a given experiment. As of ``yadg-5.0``, the `datagram` is stored as a |NetCDF|_ file. The produced `datagram` is associated with full provenance info, and the data within the `datagram` contain instrumental error estimates and are annotated with units. You can read more about **yadg** in our paper: [Kraus2022b]_.
**yadg** is a set of tools and parsers aimed to process raw instrument data. Given an experiment represented by a `dataschema`, **yadg** will process the files and folders specified in this `dataschema`, and produce a `datagram`, which is a unified data structure containing all measured ("raw") data in a given experiment. The `parsers` available in **yadg** are shown in the sidebar. As of ``yadg-5.0``, the `datagram` is stored as a |NetCDF|_ file. The produced `datagram` is associated with full provenance info, and the data within the `datagram` contain instrumental error estimates and are annotated with units. You can read more about **yadg** in our paper: [Kraus2022b]_.


.. image:: images/schema_yadg_datagram.png
:width: 600
:alt: yadg is used to process raw data files using a datadchema into a NetCDF datagram.

Several of the **yadg** parsers are exposed via an `extractor` interface, allowing the user to extract (meta)-data from individual files without requiring a `dataschema`. Several file formats are supported. You can read more about this `extractor` interface on the |marda_extractors|_ website, as well as in the :ref:`Usage: Extractor mode<extractor mode>` section of this documentation.

Some of the **yadg** parsers are exposed via an `extractor` interface, allowing the user to extract (meta)-data from individual files without requiring a `dataschema`. Several file formats are supported, as shown in the sidebar. You can read more about this `extractor` interface on the |marda_extractors|_ website, as well as in the :ref:`Usage: Extractor mode<extractor mode>` section of this documentation.

.. warning::

Expand All @@ -22,13 +24,11 @@ Several of the **yadg** parsers are exposed via an `extractor` interface, allowi

Contributors
````````````

- Peter Kraus
- Nicolas Vetsch
- `Peter Kraus <https://github.com/PeterKraus>`_
- `Nicolas Vetsch <https://github.com/vetschn>`_

Acknowledgements
````````````````

This project has received funding from the following sources:

- European Union’s Horizon 2020 programme under grant agreement ID `957189 <https://cordis.europa.eu/project/id/957189>`_.
Expand Down
12 changes: 9 additions & 3 deletions docs/source/object.datagram.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Additionally, the `datagram` is annotated by relevant metadata, including:
- clear provenance of the data;
- uniform data timestamping within and between all `datagrams`.

As of ``yadg-5.0``, the `datagram` is exported as a ``NetCDF`` file. In memory, it is represented by a :class:`datatree.DataTree`, with individual `steps` as nodes of that :class:`datatree.DataTree` containing a :class:`xarray.Dataset`.
As of ``yadg-5.0``, the `datagram` is exported as a |NetCDF|_ file. In memory, it is represented by a :class:`datatree.DataTree`, with individual `steps` as nodes of that :class:`datatree.DataTree` containing a :class:`xarray.Dataset`.

The top level :class:`datatree.DataTree` contains the following metadata stored in its attributes:

Expand All @@ -27,10 +27,16 @@ The top level :class:`datatree.DataTree` contains the following metadata stored
The contents of the attribute fields for each `step` will vary depending on the parser used to create the corresponding :class:`xarray.Dataset`. The following conventions are used:

- a `coord` field ``uts`` contains a Unix timestamp (:class:`float`),
- uncertainties for entries are stored using separate entries with names composed as ``f"{entry}_std_err``
- uncertainties for `data_vals` are stored using separate entries with names composed as ``f"{entry}_std_err"``

- the parent ``f"{entry}"`` is pointing to its uncertainty by annotation using the ``ancillary_variables`` field,
- the uncertainty links back to the ``f"{entry}"`` by annotation using the ``standard_name`` field.

- the use of spaces (and other whitespace characters) in the names of entries is to be avoided,
- the use of forward slashes (``/``) in the names of entries is not allowed.
- the use of forward slashes (``/``) in the names of entries is not allowed.

This follows the `NetCDF CF Metadata Conventions <https://cfconventions.org/Data/cf-conventions/cf-conventions-1.10/cf-conventions.html>`_, see `Section 3.4 on Ancillary Data <https://cfconventions.org/Data/cf-conventions/cf-conventions-1.10/cf-conventions.html#ancillary-data>`_.

.. _NetCDF: https://www.unidata.ucar.edu/software/netcdf/

.. |NetCDF| replace:: ``NetCDF``
24 changes: 2 additions & 22 deletions docs/source/object.dataschema.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,28 +18,8 @@ An example is a simple catalytic test with a temperature ramp. The goal of such

Despite these three devices measuring concurrently, we would have to specify three separate `steps` in the schema to process all relevant output files:

.. code-block:: json
{
"metadata": {
"provenance": {
"type": "manual"
},
"version": "4.1"
},
"steps": [{
"parser": "basiccsv",
"input": {"files": ["foo.csv"]},
"tag": "flow",
},{
"parser": "basiccsv",
"input": {"files": ["bar.csv"]}
},{
"parser": "chromtrace",
"input": {"folders": ["./GC/"]},
"parameters": {"filetype": "fusion.json"}
}]
}
.. literalinclude:: dataschema.json
:language: json

.. note::

Expand Down
27 changes: 15 additions & 12 deletions docs/source/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,16 +16,13 @@ There are two main ways of using **yadg**:

`Extractor` mode
----------------



The option to use **yadg** as an `extractor` comes as a consequence of the `MaRDA Metadata Extractors WG <https://github.com/marda-alliance/metadata_extractors>`_. In this mode, **yadg** can be invoked by providing just the `FileType` and the path to the input file:
In this mode, **yadg** can be invoked by providing just the `FileType` and the path to the input file:

.. code-block:: bash
yadg extract filetype infile [outfile]
The ``infile`` will be then parsed using **yadg** and, if successful, saved as a |NetCDF|_ file, optionally using the specified ``outfile`` location. The resulting |NetCDF|_ files will contain annotation of provenance (i.e. ``yadg extract``), `filetype` information, and the resolved defaults of `timezone`, `locale`, and `encoding` used to create the NetCDF file.
The ``infile`` will be then parsed using **yadg** and, if successful, saved as a |NetCDF|_ file, optionally using the specified ``outfile`` location. The resulting |NetCDF|_ files will contain annotation of provenance (i.e. ``yadg extract``), `filetype` information, and the resolved defaults of `timezone`, `locale`, and `encoding` used to create the file.

.. warning::

Expand All @@ -44,19 +41,21 @@ The ``infile`` will be then parsed using **yadg** and, if successful, saved as a

Metadata-only extraction
````````````````````````
To use **yadg** to extract and retrieve just the metadata contained in the input file, pass the ``-m / --meta-only`` argument:
To use **yadg** to extract and retrieve just the metadata contained in the input file, pass the ``--meta-only`` argument:

.. code-block:: bash
yadg extract -m filetype infile
yadg extract --meta-only filetype infile
The metadata are returned as a ``.json`` file, and are generated using the :func:`~xarray.Dataset.to_dict` function of :class:`xarray.Dataset`. They contain a description of the data coordinates (``coords``), dimensions (``dims``), and variables (``data_vars``), and include their names, attributes, dtypes, and shapes.

The list of supported `filetypes` that can be extracted using **yadg** can be found in the left sidebar.
The list of supported `filetypes` that can be extracted using **yadg** can be found in the left sidebar. For more information about the `extractor` concept, see |marda_extractors|_.

.. _parser mode:

`Parser` mode
-------------
The main purpose of yadg is to process a bunch of raw data files according to a provided `dataschema` into a well-defined, annotated, FAIR-data file called `datagram`. As of ``yadg-5.0``, the `datagram` is stored in |NetCDF|_ files. To use **yadg** like this, it should be invoked as follows:
The main purpose of **yadg** is to process a bunch of raw data files according to a provided `dataschema` into a well-defined, annotated, FAIR-data file called `datagram`. As of ``yadg-5.0``, the `datagram` is a |NetCDF|_ file. To use **yadg** like this, it should be invoked as follows:

.. code-block:: bash
Expand All @@ -67,7 +66,7 @@ Where ``infile`` corresponds to the ``json`` or ``yaml`` file containing the `da
In this fully-featured usage pattern via `dataschema`, **yadg** offloads the responsibility of data extraction and normalisation to its modules, called `parsers`. The currently implemented `parsers` are documented in the sidebar.

`Dataschema` from presets
+++++++++++++++++++++++++
`````````````````````````
This alternative form of using **yadg** in `parser` mode is especially useful when processing data organised in a consistent folder structure between several experimental runs. The user should prepare a `preset` file, which then gets patched to a `dataschema` file using the provided folder path:

.. code-block:: bash
Expand All @@ -93,8 +92,8 @@ Finally, the raw data files in the processed ``folder`` can be archived, checksu
This will create a `datagram` in ``outfile.json`` as well as a ``outfile.zip`` archive from the whole contents of the specified ``folder``.

`Dataschema` version updater
++++++++++++++++++++++++++++
If you'd like to update a `dataschema` from a previous version of yadg to the current latest one, use the following syntax:
````````````````````````````
If you'd like to update a `dataschema` from a previous version of **yadg** to the current latest one, use the following syntax:

.. code-block:: bash
Expand All @@ -105,4 +104,8 @@ This will update the `dataschema` specified in ``infile`` and save it to ``outfi

.. _NetCDF: https://www.unidata.ucar.edu/software/netcdf/

.. _marda_extractors: https://github.com/marda-alliance/metadata_extractors

.. |NetCDF| replace:: ``NetCDF``

.. |marda_extractors| replace:: MaRDA Metadata Extractors WG
2 changes: 1 addition & 1 deletion docs/source/version.3_1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,6 @@ Major features are:
- ``qftrace``: support for reflection trace measurements, including:

- fitting of quality factor using Lorentzian and naive methods
- fitting of quality factor using Kajfez's circle fitting method [Kajfez1994]_
- fitting of quality factor using Kajfez's circle fitting method

- ``meascsv``: support for in-house MCPT logger for flow and temperature data
2 changes: 1 addition & 1 deletion src/yadg/parsers/chromtrace/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
Schema
``````
The data is returned as a :class:`datatree.Datatree`, containing a :class:`xarray.Dataset`
The data is returned as a :class:`datatree.DataTree`, containing a :class:`xarray.Dataset`
for each trace / detector name:
.. code-block:: yaml
Expand Down

0 comments on commit 777162c

Please sign in to comment.