This document describes how to generate formatted metadata records from data spreadsheet or CSV (comma separated values) formats. Given metadata as a spreadsheet, we need to convert it into valid xml files that conform with the Dublin Core specifications. These files can then be be mapped to facets in B2FIND.
All functionality needed to convert spreadsheet data into well-formatted xml files is provided by the mdmanager.py script. This functionality can be accessed by using the processing mode g, i.e. with the otption --mode g
and by specifying a source file (the raw spreadsheet) using the option -s SOURCE
. We will use the script in this manner in this part of the tutorial.
The usage and dependancies are described in detail in the corresponding paragraph in 00.mdmanager-script.md.
Some examples of 'raw metadata', i.e. comma- and tab- seperated lists, exist in the directory samples/RAW_data.
ls samples/RAW_data
annotations.tsv locations.tsv sample.csv smallData.tsv
Up to this moment, this part of the training has been fully tested with the comma separated lists in the file sample.csv
.
In the following, we show how you can convert comma- or tab- seperated files into valid XML files in the DublinCore metadata schema. Bellow we will elaborate on how several options that were mentioned previously can be used.
To create multiple XML files in the 'Dublincore' metadata format, given a list of datasets as comma-separated values.
For this part we use the python script mdmanager.py
in the g
mode and specify a source file using the -s SOURCE
. option.
In the first step, we define the mapping of each field in the header line of the sample file
$ head -1 samples/RAW_data/sample.csv
Common name,Scientific name,Location,Temperament,Diet,Water,Size,Region of the Aquarium,Breeding
to one of the DublinCore elements (see mapfiles/dcelements.txt
) or the DublinCore terms (see mapfiles/dcterms.txt
).
You are free to assign the original fields to any one of the DublinCore fields. The only fields that are mandatory are dc:title
and dc:identifier
. The value of the latter is used to build the filename of the generated XML record.
Beside the mode (--mode g
stands for generation) and the source (-s FILE
, the path to the csv file) the following options are mandatory:
-c COMMUNITY
: a project or community name. Here, as in the rest of out training modules, we use the project name fishproject as an example.--mdprefix MDFORMAT
: This is the target metadata schema of the generation mode. Here we choose oai_dc, i.e. the Dublincore metadata schema.
Optionally, a subset and the delimiter of the original source file can be specified.
--mdsubset MDSUBSET
: The subset that determines the sub directory in which the generated XML files reside. Default is the subsetSET_1
, here we choosesample
, resulting in the sub directorysample_1
of the output path.--verb DELIMITER
: specifies the delimiter by which the sample data are separated. By default, and in our example, the deliminator is set to a comma.
From these specific parameters, the name for the mapfile is built as mapfiles/<community>-<mdprefix>.csv
, and the output path is generated as oaidata/<community>-<mdprefix>/<mdsubset>_1
.
The first time that the script is called, there will be no mapfile available. So you asked to enter a valid DublinCore element or term for each original header field:
./mdmanager.py --mode g -c fishproject -s samples/RAW_data/sample.csv --mdprefix oai_dc --mdsubset sample --verb comma
...
|- Generation started : 2016-10-18 17:51:34
|- Original fields:
['Common name', 'Scientific name', 'Location', 'Temperament', 'Diet', 'Water', 'Size', 'Region of the Aquarium', 'Breeding']
|- Generate mapfile mapfiles/fishproject-oai_dc.csv
Target field for Common name : dc:identifier
Target field for Scientific name : dc:title
Target field for Location : dcterms:spatial
Target field for Temperament : dc:type
Target field for Diet : dcterms:requires
Target field for Water : dcterms:medium
Target field for Size : dcterms:extent
Target field for Region of the Aquarium : dcterms:provenance
Target field for Breeding : dcterms:created
|- Generate XML files in oaidata/fishproject-oai_dc/sample_1/xml
|--> Compressiceps.xml
|--> Frontosa.xml
|--> GoldenJulie.xml
|--> ConvictJulie.xml
|--> PrincessCichlid.xml
|--> Cylindricus.xml
|--> Multi.xml
|--> FiveBarCichlid.xml
Note : The community name is given as fishproject. In all subsequent steps, the community name should not be changed. It denotes the owner of the data and is used to tie all subsequent steps together. This command does not only generate the mapping file
fishproject-oai_dc.csv
, but the mapping is excecuted and the associated XML files are written to the output directoryoaidata/fishproject-oai_dc/sample_1/xml/
.
Note : An issue that we deal with here is that of semantic mapping and homogenisation of a research-specific vocabulary to a common, restricted vocabulary or schema, such as DublinCore. The mapping in the example above is quite "formal". For instance
Diet
is mapped todcterms:requires
, even though the intention of the DublinCore consortium by introducing the fieldrequires
was most probably not in the sense thata plant requires a specific diet
. So, if we want to cover these specific biological properties we have to add an extented shema, which covers the propertyDiet
orTemparament
in this context.
For this use case we search for metadata specification of fish in lakes
. A good starting point to find an appropriate metadata specification is the overview over Disciplinary Metadata provided by DCC. Just search in the area Biology
for a standard, which describes properties of fish in lakes. (To be honest it is hard to find a schema that covers properties as Diet
or Temperament
. But the standards DarwinCore
or EML
may at least cover some of the Biology specific
properties.)
Repeat the generation and do your own mapping by adapting the mapping rules in the mapfile to your needs. Below we demonstrate the contents of the mapfile. The seperator >
is used to seperate the original field name and the target Dublincore element or term :
$ vi mapfiles/fishproject-oai_dc.csv
Common name>dc:identifier
Scientific name>dc:title
Location>dcterms:spatial
Temperament>dc:type
Diet>dcterms:requires
Water>dcterms:medium
Size>dcterms:extent
Region of the Aquarium>dcterms:provenance
Breeding>dcterms:created
Check the results, the generated XML files in the output directory oaidata/fishproject-oai_dc/sample_1/xml/
. For example, the contents of one of the files that is generated using the above maping can be seen bellow:
less oaidata/fishproject-oai_dc/sample_1/xml/Compressiceps.xml
<?xml version="1.0"?>
<metadata
xmlns="http://example.org/myapp/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://purl.org/dc/elements/1.1/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:dcterms="http://purl.org/dc/terms/">
<dc:identifier>Compressiceps</dc:identifier>
<dc:title>Haplochromis compressiceps</dc:title>
<dc:type>Territorial</dc:type>
<dcterms:created>Hard</dcterms:created>
<dcterms:extent>5 inches</dcterms:extent>
<dcterms:medium>PH 7.0 - 8.0, Temp. 73 - 77 F</dcterms:medium>
<dcterms:provenance>Bottom</dcterms:provenance>
<dcterms:requires>Omnivore</dcterms:requires>
<dcterms:spatial>Lake Tanganyika</dcterms:spatial>
</metadata>
You can check the files using an XML validator software, for instance http://www.xmlvalidation.com/.
Here the sample metadata are given in a tab separated tsv
file, locations.tsv
.
./mdmanager.py --mode g -c seisnet -s samples/RAW_data/locations.tsv --mdprefix oai_dc --mdsubset locations --verb tab
Version: 2.0
Run mode: Generation
Start : 2016-12-15 10:05:05
|- Generation started : 2016-12-15 10:05:05
|- From tab-separated spreadsheet
samples/RAW_data/locations.tsv
|- with original fields (headline)
['Project', 'Creators', 'Publisher', 'Subject', 'Created', 'Location-Coord', 'Location-Name']
|- using existing mapfile mapfiles/seisnet-oai_dc.csv
|- generate XML files in oaidata/seisnet-oai_dc/locations_1/xml
|--> TienshanPamirGeodynamicsProjectTIPAGE.xml
|--> LakeTobaseismicnetwork-Sumatra-Indonesia.xml
........
As you can see, we already provide a mapfile mapfiles/seisnet-oai_dc.csv
with rules to map the original fields to Dublincore elements and terms. Feel free to change this mapping as you please, and check the effect on the 25 resulting XML files in oaidata/seisnet-oai_dc/locations_1/xml
.