Add datatree, example from proper json.

dgbowl · Oct 15, 2023 · 777162c · 777162c
1 parent 483cf3c
commit 777162c
Show file tree

Hide file tree

Showing 8 changed files with 61 additions and 45 deletions.
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -86,4 +86,5 @@
 intersphinx_mapping = {
     "dgbowl_schemas": ("https://dgbowl.github.io/dgbowl-schemas/master", None),
     "xarray": ("https://docs.xarray.dev/en/stable", None),
+    "datatree": ("https://xarray-datatree.readthedocs.io/en/latest/", None),
 }
diff --git a/docs/source/dataschema.json b/docs/source/dataschema.json
@@ -0,0 +1,26 @@
+{
+    "metadata": {
+        "version": "5.0",
+        "provenance": {"type": "manual"}
+    },
+    "steps": [
+        {
+            "tag": "flow",
+            "parser": "basiccsv",
+            "input": {"files": ["foo.csv"]},
+            "extractor": {"filetype": "None"},
+            "parameters": {"sep": ","}
+        },
+        {
+            "parser": "basiccsv",
+            "input": {"files": ["bar.csv"]},
+            "extractor": {"filetype": "None"},
+            "parameters": {"sep": ","}
+        },
+        {
+            "parser": "chromtrace",
+            "input": {"files": ["./GC/"]},
+            "extractor": {"filetype": "fusion.json"}
+        }
+    ]
+}
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -7,13 +7,15 @@
 .. image:: https://badgen.net/github/tag/dgbowl/yadg/?icon=github
    :target: https://github.com/dgbowl/yadg
 
-**yadg** is a set of tools and parsers aimed to process raw instrument data. Given an experiment represented by a `dataschema`, **yadg** will process the files and folders specified in this `dataschema`, and produce a `datagram` -- a unified data structure containing all measured ("raw") data in a given experiment. As of ``yadg-5.0``, the `datagram` is stored as a |NetCDF|_ file. The produced `datagram` is associated with full provenance info, and the data within the `datagram` contain instrumental error estimates and are annotated with units. You can read more about **yadg** in our paper: [Kraus2022b]_.
+**yadg** is a set of tools and parsers aimed to process raw instrument data. Given an experiment represented by a `dataschema`, **yadg** will process the files and folders specified in this `dataschema`, and produce a `datagram`, which is a unified data structure containing all measured ("raw") data in a given experiment. The `parsers` available in **yadg** are shown in the sidebar. As of ``yadg-5.0``, the `datagram` is stored as a |NetCDF|_ file. The produced `datagram` is associated with full provenance info, and the data within the `datagram` contain instrumental error estimates and are annotated with units. You can read more about **yadg** in our paper: [Kraus2022b]_.
+
 
 .. image:: images/schema_yadg_datagram.png
    :width: 600
    :alt: yadg is used to process raw data files using a datadchema into a NetCDF datagram.
 
-Several of the **yadg** parsers are exposed via an `extractor` interface, allowing the user to extract (meta)-data from individual files without requiring a `dataschema`. Several file formats are supported. You can read more about this `extractor` interface on the |marda_extractors|_ website, as well as in the :ref:`Usage: Extractor mode<extractor mode>` section of this documentation.
+
+Some of the **yadg** parsers are exposed via an `extractor` interface, allowing the user to extract (meta)-data from individual files without requiring a `dataschema`. Several file formats are supported, as shown in the sidebar. You can read more about this `extractor` interface on the |marda_extractors|_ website, as well as in the :ref:`Usage: Extractor mode<extractor mode>` section of this documentation.
 
 .. warning::
 
@@ -22,13 +24,11 @@ Several of the **yadg** parsers are exposed via an `extractor` interface, allowi
 
 Contributors
 ````````````
-
-- Peter Kraus
-- Nicolas Vetsch
+- `Peter Kraus <https://github.com/PeterKraus>`_
+- `Nicolas Vetsch <https://github.com/vetschn>`_
 
 Acknowledgements
 ````````````````
-
 This project has received funding from the following sources:
 
 - European Union’s Horizon 2020 programme under grant agreement ID `957189 <https://cordis.europa.eu/project/id/957189>`_.

diff --git a/docs/source/object.datagram.rst b/docs/source/object.datagram.rst
@@ -15,7 +15,7 @@ Additionally, the `datagram` is annotated by relevant metadata, including:
     - clear provenance of the data;
     - uniform data timestamping within and between all `datagrams`.
 
-As of ``yadg-5.0``, the `datagram` is exported as a ``NetCDF`` file. In memory, it is represented by a :class:`datatree.DataTree`, with individual `steps` as nodes of that :class:`datatree.DataTree` containing a :class:`xarray.Dataset`.
+As of ``yadg-5.0``, the `datagram` is exported as a |NetCDF|_ file. In memory, it is represented by a :class:`datatree.DataTree`, with individual `steps` as nodes of that :class:`datatree.DataTree` containing a :class:`xarray.Dataset`.
 
 The top level :class:`datatree.DataTree` contains the following metadata stored in its attributes:
 
@@ -27,10 +27,16 @@ The top level :class:`datatree.DataTree` contains the following metadata stored
 The contents of the attribute fields for each `step` will vary depending on the parser used to create the corresponding :class:`xarray.Dataset`. The following conventions are used:
 
     - a `coord` field ``uts`` contains a Unix timestamp (:class:`float`),
-    - uncertainties for entries are stored using separate entries with names composed as ``f"{entry}_std_err``
+    - uncertainties for `data_vals` are stored using separate entries with names composed as ``f"{entry}_std_err"``
 
        - the parent ``f"{entry}"`` is pointing to its uncertainty by annotation using the ``ancillary_variables`` field,
        - the uncertainty links back to the ``f"{entry}"`` by annotation using the ``standard_name`` field.
 
     - the use of spaces (and other whitespace characters) in the names of entries is to be avoided,
-    - the use of forward slashes (``/``) in the names of entries is not allowed.
+    - the use of forward slashes (``/``) in the names of entries is not allowed.
+
+This follows the `NetCDF CF Metadata Conventions <https://cfconventions.org/Data/cf-conventions/cf-conventions-1.10/cf-conventions.html>`_, see `Section 3.4 on Ancillary Data <https://cfconventions.org/Data/cf-conventions/cf-conventions-1.10/cf-conventions.html#ancillary-data>`_.
+
+.. _NetCDF: https://www.unidata.ucar.edu/software/netcdf/
+
+.. |NetCDF| replace:: ``NetCDF``
diff --git a/docs/source/object.dataschema.rst b/docs/source/object.dataschema.rst
@@ -18,28 +18,8 @@ An example is a simple catalytic test with a temperature ramp. The goal of such
 
 Despite these three devices measuring concurrently, we would have to specify three separate `steps` in the schema to process all relevant output files:
 
-.. code-block:: json
-
-    {
-        "metadata": {
-            "provenance": {
-                "type": "manual"
-            },
-            "version": "4.1"
-        },
-        "steps": [{
-            "parser": "basiccsv",
-            "input": {"files": ["foo.csv"]},
-            "tag": "flow",
-        },{
-            "parser": "basiccsv",
-            "input": {"files": ["bar.csv"]}
-        },{
-            "parser": "chromtrace",
-            "input": {"folders": ["./GC/"]},
-            "parameters": {"filetype": "fusion.json"}
-        }]
-    }
+.. literalinclude:: dataschema.json
+  :language: json
 
 .. note::
 

diff --git a/docs/source/usage.rst b/docs/source/usage.rst
@@ -16,16 +16,13 @@ There are two main ways of using **yadg**:
 
 `Extractor` mode
 ----------------
-
-
-
-The option to use **yadg** as an `extractor` comes as a consequence of the `MaRDA Metadata Extractors WG <https://github.com/marda-alliance/metadata_extractors>`_. In this mode, **yadg** can be invoked by providing just the `FileType` and the path to the input file:
+In this mode, **yadg** can be invoked by providing just the `FileType` and the path to the input file:
 
 .. code-block:: bash
 
     yadg extract filetype infile [outfile]
 
-The ``infile`` will be then parsed using **yadg** and, if successful, saved as a |NetCDF|_ file, optionally using the specified ``outfile`` location. The resulting |NetCDF|_ files will contain annotation of provenance (i.e. ``yadg extract``), `filetype` information, and the resolved defaults of `timezone`, `locale`, and `encoding` used to create the NetCDF file.
+The ``infile`` will be then parsed using **yadg** and, if successful, saved as a |NetCDF|_ file, optionally using the specified ``outfile`` location. The resulting |NetCDF|_ files will contain annotation of provenance (i.e. ``yadg extract``), `filetype` information, and the resolved defaults of `timezone`, `locale`, and `encoding` used to create the file.
 
 .. warning::
 
@@ -44,19 +41,21 @@ The ``infile`` will be then parsed using **yadg** and, if successful, saved as a
 
 Metadata-only extraction
 ````````````````````````
-To use **yadg** to extract and retrieve just the metadata contained in the input file, pass the ``-m / --meta-only`` argument:
+To use **yadg** to extract and retrieve just the metadata contained in the input file, pass the ``--meta-only`` argument:
 
 .. code-block:: bash
 
-    yadg extract -m filetype infile
+    yadg extract --meta-only filetype infile
 
 The metadata are returned as a ``.json`` file, and are generated using the :func:`~xarray.Dataset.to_dict` function of :class:`xarray.Dataset`. They contain a description of the data coordinates (``coords``), dimensions (``dims``), and variables (``data_vars``), and include their names, attributes, dtypes, and shapes.
 
-The list of supported `filetypes` that can be extracted using **yadg** can be found in the left sidebar.
+The list of supported `filetypes` that can be extracted using **yadg** can be found in the left sidebar. For more information about the `extractor` concept, see |marda_extractors|_.
+
+.. _parser mode:
 
 `Parser` mode
 -------------
-The main purpose of yadg is to process a bunch of raw data files according to a provided `dataschema` into a well-defined, annotated, FAIR-data file called `datagram`. As of ``yadg-5.0``, the `datagram` is stored in |NetCDF|_ files. To use **yadg** like this, it should be invoked as follows:
+The main purpose of **yadg** is to process a bunch of raw data files according to a provided `dataschema` into a well-defined, annotated, FAIR-data file called `datagram`. As of ``yadg-5.0``, the `datagram` is a |NetCDF|_ file. To use **yadg** like this, it should be invoked as follows:
 
 .. code-block:: bash
 
@@ -67,7 +66,7 @@ Where ``infile`` corresponds to the ``json`` or ``yaml`` file containing the `da
 In this fully-featured usage pattern via `dataschema`, **yadg** offloads the responsibility of data extraction and normalisation to its modules, called `parsers`. The currently implemented `parsers` are documented in the sidebar.
 
 `Dataschema` from presets
-+++++++++++++++++++++++++
+`````````````````````````
 This alternative form of using **yadg** in `parser` mode is especially useful when processing data organised in a consistent folder structure between several experimental runs. The user should prepare a `preset` file, which then gets patched to a `dataschema` file using the provided folder path:
 
 .. code-block:: bash
@@ -93,8 +92,8 @@ Finally, the raw data files in the processed ``folder`` can be archived, checksu
 This will create a `datagram` in ``outfile.json`` as well as a ``outfile.zip`` archive from the whole contents of the specified ``folder``.
 
 `Dataschema` version updater
-++++++++++++++++++++++++++++
-If you'd like to update a `dataschema` from a previous version of yadg to the current latest one, use the following syntax:
+````````````````````````````
+If you'd like to update a `dataschema` from a previous version of **yadg** to the current latest one, use the following syntax:
 
 .. code-block:: bash
 
@@ -105,4 +104,8 @@ This will update the `dataschema` specified in ``infile`` and save it to ``outfi
 
 .. _NetCDF: https://www.unidata.ucar.edu/software/netcdf/
 
+.. _marda_extractors: https://github.com/marda-alliance/metadata_extractors
+
 .. |NetCDF| replace:: ``NetCDF``
+
+.. |marda_extractors| replace:: MaRDA Metadata Extractors WG
diff --git a/docs/source/version.3_1.rst b/docs/source/version.3_1.rst
@@ -22,6 +22,6 @@ Major features are:
   - ``qftrace``: support for reflection trace measurements, including:
 
     - fitting of quality factor using Lorentzian and naive methods
-    - fitting of quality factor using Kajfez's circle fitting method [Kajfez1994]_
+    - fitting of quality factor using Kajfez's circle fitting method
 
   - ``meascsv``: support for in-house MCPT logger for flow and temperature data
diff --git a/src/yadg/parsers/chromtrace/__init__.py b/src/yadg/parsers/chromtrace/__init__.py
@@ -41,7 +41,7 @@
 
 Schema
 ``````
-The data is returned as a :class:`datatree.Datatree`, containing a :class:`xarray.Dataset`
+The data is returned as a :class:`datatree.DataTree`, containing a :class:`xarray.Dataset`
 for each trace / detector name:
 
 .. code-block:: yaml