Merge pull request #172 from European-XFEL/data-format-docs

Add some (very) high-level docs for the data format
European-XFEL · Jan 18, 2024 · 71ca9aa · 71ca9aa
2 parents 524c7f6 + 7736a04
commit 71ca9aa
Show file tree

Hide file tree

Showing 2 changed files with 109 additions and 0 deletions.
diff --git a/docs/data-storage.md b/docs/data-storage.md
@@ -0,0 +1,108 @@
+# Data storage
+
+This page describes how DAMNIT saves the data it creates. This will probably not
+be of interest to you unless you're developing DAMNIT or writing tools that read
+data created by DAMNIT.
+
+---
+
+## Overview
+
+There are two types of storage used, both in the `usr/Shared/amore` directory of
+a proposal by default:
+
+- HDF5 files are saved in the `extracted_data/` subdirectory for each run. These
+  hold the values returned by the variable functions in the context file.
+- A SQLite database named `runs.sqlite` that contains things like:
+    - Data entered by the user through the GUI, such as comments or editable
+      variable values.
+    - Summary data for all the variables. The summaries are also stored in the
+      HDF5 file but they're cached in the database too so that the GUI doesn't
+      have to open a bunch of HDF5 files to display the table.
+    - General DAMNIT settings for things like the slurm partition to use etc.
+
+The DAMNIT data format details the exact structure of the data in the database
+and HDF5 files.
+
+## v1 (current)
+
+In v1 there were a few changes to the way we store images and `xarray` types:
+
+- Thumbnails are stored as PNG byte arrays.
+- Support for `xarray.Dataset`'s was added.
+- `Dataset`'s and `DataArray`'s are stored inline in the NetCDF 4 format. This
+  means we get support for saving all properties of `Dataset`'s/`DataArray`'s
+  for free (e.g. attributes).
+
+The most important change was to the database schema, which moved to a 'long
+narrow' format so that we don't need to change the schema whenever a new
+variable is added. It should also allow for versioning variables in the future.
+
+## v0
+
+v0 refers to the first format, created before we started versioning at
+all. Here's an example v0 HDF5 file for a single run:
+```bash
+p1234_r100.h5
+
+# Start off with a group to hold the summary values
+
+├.reduced
+│ ├scalar       [float64: scalar]
+│ ├ndarray      [float64: scalar]
+│ ├string       [UTF-8 string: 1]
+│ ├dataarray    [float64: scalar]
+│ ├2d_array     [uint8: 100 × 100 × 4] # 2D arrays are treated as images
+│ └rgba_image   [uint8: 75 × 150 × 4]  # Image summaries are downscaled RGBA images
+
+# After the .reduced group we have groups for each variable
+
+├scalar
+│ └data   [int64: scalar]
+├ndarray
+│ ├data   [float64: 100]
+├string
+│ └data   [UTF-8 string: 1]
+├dataarray   # The coordinates are saved, but not the dimension-coord mapping
+│ ├data      [float64: 100 × 16 × 512 × 128]
+│ ├dim_0     [int64: 512]
+│ ├dim_1     [int64: 128]
+│ ├module    [int64: 16]
+│ └pulseId   [int64: 300]
+├2d_array
+│ └data   [float64: 1352 × 1196]
+├rgba_image
+│ └data   [uint8: 600 × 1200 × 4]
+```
+
+TL;DR:
+
+- There's a group for all the summaries, and then a group per-variable for the
+  object returned from the variable function.
+- The 'main value' of each variable is always stored in the `data` dataset in
+  the variables group. This is only relevant for DataArray's, which can have
+  multiple datasets.
+- 2D arrays are treated as images, and their summary is an RGBA thumbnail.
+
+When it comes to the SQLite database schema, the most important thing to know is
+that there was one big `runs` table:
+
+```
+proposal | runnr | start_time        | added_at         | comment            | var1               | var2               | var3     | var4            |
+---------|-------|-------------------|------------------|--------------------|--------------------|--------------------|----------|-----------------|
+3422     | 1     | 1683091429.382994 | 1683098801.769   | agipd dark         | 300.5509948730469  | 1.666236494202166… | 608      | 0.161871538018  |
+3422     | 2     | 1683091513.36196  | 1683098860.364   | agipd dark         | 300.5509948730469  | 1.672673170105554… | 607      | 0.160357959558  |
+3422     | 3     | 1683091596.528593 | 1683098974.7     | agipd dark         | 300.5509948730469  | 1.676370447967201… | 605      | 0.160361199298  |
+3422     | 4     | 1683096460.784313 | 1683103844.035   | agipd dark         | 300.5509948730469  | 1.72180516528897e… | 603      | 0.159418800086  |
+3422     | 5     | 1683096542.744437 | 1683103901.837   | agipd dark         | 300.5509948730469  | 1.722106935631018… | 601      | 0.159602915718  |
+3422     | 6     | 1683096624.703127 | 1683103959.994   | agipd dark         | 300.5509948730469  | 1.722659908409696… | 607      | 0.161150277002  |
+3422     | 7     | 1683099077.949262 | 1683106416.386   | dark               | 300.5509948730469  | 1.696372237347532… | 596      | 0.167543028426  |
+3422     | 8     | 1683099158.289963 | 1683106530.607   | dark               | 300.5509948730469  | 1.693211743258871… | 600      | 0.169153788618  |
+3422     | 9     | 1683099239.020509 | 1683106603.581   | dark               | 300.5509948730469  | 1.694534876151010… | 596      | 0.167274568394  |
+3422     | 10    | 1683109651.413693 | 1683117096.313   |                    | 300.551025390625   | 1.726133086776826… | 1578     | 0.447119904602  |
+3422     | 11    | 1683109907.527204 | 1683117352.791   | Knife edge scan    | 300.551025390625   | 1.722175329632591… | 1564     | 0.459057441431  |
+```
+
+And every time a variable was added another column would be added to the table
+by changing its schema. There are a few other tables in the database but they're
+not so important.
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -33,4 +33,5 @@ nav:
   - index.md
   - gui.md
   - backend.md
+  - data-storage.md
   - contact.md