Helpers to fetch and build the databases required for LABE API server. Specifically we need:
- checks for new OCI/COCI releases, from https://opencitations.net/download
- regular dumps of internal SOLR indices and conversion into sqlite3 databases, using solrdump
The luigi task orchestrator is already used by SLUB Dresden, so we'll use it for modelling the dependency graph of tasks.
Everything lives in files.
We build a single executable with shiv (see LI ENG blog post).
shiv allows us to create a single binary artifact from a Python project that includes all of its dependencies. The only thing required to run a full-fledged Python application is an interpreter.
- check for new OCI dump
- download OCI dump
- turn OCI dump into a more managable format (single zstd file)
- fetch SOLR index copy, e.g. via solrdump
- turn SOLR JSON files into (id, doc) TSV
- create sqlite3 database from TSV with makta
- move databases into place
Additionally, we want:
- monitoring, if a task fails (to a service email)
- a delta report
- cache warmup, if necessary (yes, it is; currently squashed into the "cron" line)
- cleanup of obsolete tasks
Constraints: We will only have disk space for a single update. We may want to
reduce the index data size, e.g. reduce in a pipe while dumping from solr or
select a number of fields (solrdump -fl ...
) -- Update: we abridge the SOLR
documents ("short"), and save disk space; one complete update with intermediate
artifacts occupies about 310G as of 01/2022.
- run "labe.pyz -r CombinedUpdate", e.g. daily
- run "rm -f $(labe.pyz --list-deletable)", e.g. daily
Example cron: roles/labe/tasks/main.yml
- 13G + 150G + 42G + 5.5G + 2.5G = 213G in databases
- a current set of outputs and intermediate files: 308G
- total disk space: 1007G
Currently (01/2022) we can accommodate two full copies at the same time. We have headroom for a 30% increase in data size, before we need to take additional measures (e.g. delete intermediate artifacts, too).
$ tree -sh
.
├── [4.0K] IdMappingDatabase
│ ├── [ 67] current -> /usr/local/share/labe/IdMappingDatabase/date-2022-01-10.db
│ └── [ 13G] date-2022-01-10.db
├── [4.0K] IdMappingTable
│ ├── [ 69] current -> /usr/local/share/labe/IdMappingTable/date-2022-01-10.tsv.zst
│ └── [452M] date-2022-01-10.tsv.zst
├── [4.0K] OpenCitationsDatabase
│ ├── [150G] c90e82e35c9d02c00f81bee6d1f34b132953398c.db
│ └── [ 96] current -> /usr/local/share/labe/OpenCitationsDatabase/c90e82e35c9d02c00f81bee6d1f34b132953398c.db
├── [4.0K] OpenCitationsDownload
│ ├── [ 31G] c90e82e35c9d02c00f81bee6d1f34b132953398c.zip
│ └── [ 97] current -> /usr/local/share/labe/OpenCitationsDownload/c90e82e35c9d02c00f81bee6d1f34b132953398c.zip
├── [4.0K] OpenCitationsSingleFile
│ ├── [ 31G] c90e82e35c9d02c00f81bee6d1f34b132953398c.zst
│ └── [ 99] current -> /usr/local/share/labe/OpenCitationsSingleFile/c90e82e35c9d02c00f81bee6d1f34b132953398c.zst
├── [4.0K] SolrDatabase
│ ├── [ 76] current-ai-short -> /usr/local/share/labe/SolrDatabase/date-2022-01-10-name-ai-short.db
│ ├── [ 78] current-main -> /usr/local/share/labe/SolrDatabase/date-2022-01-09-name-main-short.db
│ ├── [ 78] current-main-short -> /usr/local/share/labe/SolrDatabase/date-2022-01-10-name-main-short.db
│ ├── [ 83] current-slub-production -> /usr/local/share/labe/SolrDatabase/date-2022-01-10-name-slub-production.db
│ ├── [5.5G] date-2022-01-09-name-main-short.db
│ ├── [ 42G] date-2022-01-10-name-ai-short.db
│ ├── [5.5G] date-2022-01-10-name-main-short.db
│ └── [2.5G] date-2022-01-10-name-slub-production.db
└── [4.0K] SolrFetchDocs
├── [ 78] current-ai-short -> /usr/local/share/labe/SolrFetchDocs/date-2022-01-10-name-ai-short.zst
├── [ 74] current-main -> /usr/local/share/labe/SolrFetchDocs/date-2022-01-10-name-main.zst
├── [ 80] current-main-short -> /usr/local/share/labe/SolrFetchDocs/date-2022-01-10-name-main-short.zst
├── [ 85] current-slub-production -> /usr/local/share/labe/SolrFetchDocs/date-2022-01-10-name-slub-production.zst
├── [5.7G] date-2022-01-10-name-ai-short.zst
├── [968M] date-2022-01-10-name-main-short.zst
├── [ 20G] date-2022-01-10-name-main.zst
└── [231M] date-2022-01-10-name-slub-production.zst
7 directories, 26 files
Most tasks mostly are templated shell commands.