Skip to content

Commit

Permalink
Merge pull request #172 from European-XFEL/data-format-docs
Browse files Browse the repository at this point in the history
Add some (very) high-level docs for the data format
  • Loading branch information
JamesWrigley authored Jan 18, 2024
2 parents 524c7f6 + 7736a04 commit 71ca9aa
Show file tree
Hide file tree
Showing 2 changed files with 109 additions and 0 deletions.
108 changes: 108 additions & 0 deletions docs/data-storage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Data storage

This page describes how DAMNIT saves the data it creates. This will probably not
be of interest to you unless you're developing DAMNIT or writing tools that read
data created by DAMNIT.

---

## Overview

There are two types of storage used, both in the `usr/Shared/amore` directory of
a proposal by default:

- HDF5 files are saved in the `extracted_data/` subdirectory for each run. These
hold the values returned by the variable functions in the context file.
- A SQLite database named `runs.sqlite` that contains things like:
- Data entered by the user through the GUI, such as comments or editable
variable values.
- Summary data for all the variables. The summaries are also stored in the
HDF5 file but they're cached in the database too so that the GUI doesn't
have to open a bunch of HDF5 files to display the table.
- General DAMNIT settings for things like the slurm partition to use etc.

The DAMNIT data format details the exact structure of the data in the database
and HDF5 files.

## v1 (current)

In v1 there were a few changes to the way we store images and `xarray` types:

- Thumbnails are stored as PNG byte arrays.
- Support for `xarray.Dataset`'s was added.
- `Dataset`'s and `DataArray`'s are stored inline in the NetCDF 4 format. This
means we get support for saving all properties of `Dataset`'s/`DataArray`'s
for free (e.g. attributes).

The most important change was to the database schema, which moved to a 'long
narrow' format so that we don't need to change the schema whenever a new
variable is added. It should also allow for versioning variables in the future.

## v0

v0 refers to the first format, created before we started versioning at
all. Here's an example v0 HDF5 file for a single run:
```bash
p1234_r100.h5

# Start off with a group to hold the summary values

├.reduced
│ ├scalar [float64: scalar]
│ ├ndarray [float64: scalar]
│ ├string [UTF-8 string: 1]
│ ├dataarray [float64: scalar]
│ ├2d_array [uint8: 100 × 100 × 4] # 2D arrays are treated as images
│ └rgba_image [uint8: 75 × 150 × 4] # Image summaries are downscaled RGBA images

# After the .reduced group we have groups for each variable

├scalar
│ └data [int64: scalar]
├ndarray
│ ├data [float64: 100]
├string
│ └data [UTF-8 string: 1]
├dataarray # The coordinates are saved, but not the dimension-coord mapping
│ ├data [float64: 100 × 16 × 512 × 128]
│ ├dim_0 [int64: 512]
│ ├dim_1 [int64: 128]
│ ├module [int64: 16]
│ └pulseId [int64: 300]
├2d_array
│ └data [float64: 1352 × 1196]
├rgba_image
│ └data [uint8: 600 × 1200 × 4]
```

TL;DR:

- There's a group for all the summaries, and then a group per-variable for the
object returned from the variable function.
- The 'main value' of each variable is always stored in the `data` dataset in
the variables group. This is only relevant for DataArray's, which can have
multiple datasets.
- 2D arrays are treated as images, and their summary is an RGBA thumbnail.

When it comes to the SQLite database schema, the most important thing to know is
that there was one big `runs` table:

```
proposal | runnr | start_time | added_at | comment | var1 | var2 | var3 | var4 |
---------|-------|-------------------|------------------|--------------------|--------------------|--------------------|----------|-----------------|
3422 | 1 | 1683091429.382994 | 1683098801.769 | agipd dark | 300.5509948730469 | 1.666236494202166… | 608 | 0.161871538018 |
3422 | 2 | 1683091513.36196 | 1683098860.364 | agipd dark | 300.5509948730469 | 1.672673170105554… | 607 | 0.160357959558 |
3422 | 3 | 1683091596.528593 | 1683098974.7 | agipd dark | 300.5509948730469 | 1.676370447967201… | 605 | 0.160361199298 |
3422 | 4 | 1683096460.784313 | 1683103844.035 | agipd dark | 300.5509948730469 | 1.72180516528897e… | 603 | 0.159418800086 |
3422 | 5 | 1683096542.744437 | 1683103901.837 | agipd dark | 300.5509948730469 | 1.722106935631018… | 601 | 0.159602915718 |
3422 | 6 | 1683096624.703127 | 1683103959.994 | agipd dark | 300.5509948730469 | 1.722659908409696… | 607 | 0.161150277002 |
3422 | 7 | 1683099077.949262 | 1683106416.386 | dark | 300.5509948730469 | 1.696372237347532… | 596 | 0.167543028426 |
3422 | 8 | 1683099158.289963 | 1683106530.607 | dark | 300.5509948730469 | 1.693211743258871… | 600 | 0.169153788618 |
3422 | 9 | 1683099239.020509 | 1683106603.581 | dark | 300.5509948730469 | 1.694534876151010… | 596 | 0.167274568394 |
3422 | 10 | 1683109651.413693 | 1683117096.313 | | 300.551025390625 | 1.726133086776826… | 1578 | 0.447119904602 |
3422 | 11 | 1683109907.527204 | 1683117352.791 | Knife edge scan | 300.551025390625 | 1.722175329632591… | 1564 | 0.459057441431 |
```

And every time a variable was added another column would be added to the table
by changing its schema. There are a few other tables in the database but they're
not so important.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,4 +33,5 @@ nav:
- index.md
- gui.md
- backend.md
- data-storage.md
- contact.md

0 comments on commit 71ca9aa

Please sign in to comment.