Skip to content

Latest commit

 

History

History
162 lines (128 loc) · 9.52 KB

01.b-generate-metadata.md

File metadata and controls

162 lines (128 loc) · 9.52 KB

Generating metadata

This document describes how to generate formatted metadata records from data spreadsheet or CSV (comma separated values) formats. Given metadata as a spreadsheet, we need to convert it into valid xml files that conform with the Dublin Core specifications. These files can then be be mapped to facets in B2FIND.

Prerequisites

1. Using the mdmanager.py Python script in Generation mode

All functionality needed to convert spreadsheet data into well-formatted xml files is provided by the mdmanager.py script. This functionality can be accessed by using the processing mode g, i.e. with the otption --mode g and by specifying a source file (the raw spreadsheet) using the option -s SOURCE. We will use the script in this manner in this part of the tutorial. The usage and dependancies are described in detail in the corresponding paragraph in 00.mdmanager-script.md.

2. Use Cases: Some samples of raw data

Some examples of 'raw metadata', i.e. comma- and tab- seperated lists, exist in the directory samples/RAW_data.

ls samples/RAW_data
annotations.tsv  locations.tsv  sample.csv  smallData.tsv

Up to this moment, this part of the training has been fully tested with the comma separated lists in the file sample.csv.

In the following, we show how you can convert comma- or tab- seperated files into valid XML files in the DublinCore metadata schema. Bellow we will elaborate on how several options that were mentioned previously can be used.

1. Generating multiple XML files in batch mode

Goal

To create multiple XML files in the 'Dublincore' metadata format, given a list of datasets as comma-separated values.

Solution

For this part we use the python script mdmanager.py in the g mode and specify a source file using the -s SOURCE. option.

1.2. Creating a mapfile

In the first step, we define the mapping of each field in the header line of the sample file

$ head -1 samples/RAW_data/sample.csv 
Common name,Scientific name,Location,Temperament,Diet,Water,Size,Region of the Aquarium,Breeding

to one of the DublinCore elements (see mapfiles/dcelements.txt) or the DublinCore terms (see mapfiles/dcterms.txt).

You are free to assign the original fields to any one of the DublinCore fields. The only fields that are mandatory are dc:title and dc:identifier. The value of the latter is used to build the filename of the generated XML record.

Beside the mode (--mode g stands for generation) and the source (-s FILE, the path to the csv file) the following options are mandatory:

  • -c COMMUNITY : a project or community name. Here, as in the rest of out training modules, we use the project name fishproject as an example.
  • --mdprefix MDFORMAT : This is the target metadata schema of the generation mode. Here we choose oai_dc, i.e. the Dublincore metadata schema.

Optionally, a subset and the delimiter of the original source file can be specified.

  • --mdsubset MDSUBSET : The subset that determines the sub directory in which the generated XML files reside. Default is the subset SET_1, here we choose sample, resulting in the sub directory sample_1 of the output path.
  • --verb DELIMITER: specifies the delimiter by which the sample data are separated. By default, and in our example, the deliminator is set to a comma.

From these specific parameters, the name for the mapfile is built as mapfiles/<community>-<mdprefix>.csv, and the output path is generated as oaidata/<community>-<mdprefix>/<mdsubset>_1.

The first time that the script is called, there will be no mapfile available. So you asked to enter a valid DublinCore element or term for each original header field:

./mdmanager.py --mode g -c fishproject -s samples/RAW_data/sample.csv --mdprefix oai_dc --mdsubset sample --verb comma
...
|- Generation started : 2016-10-18 17:51:34
 |- Original fields:
	['Common name', 'Scientific name', 'Location', 'Temperament', 'Diet', 'Water', 'Size', 'Region of the Aquarium', 'Breeding']
 |- Generate mapfile	mapfiles/fishproject-oai_dc.csv
Target field for Common name : dc:identifier
Target field for Scientific name : dc:title
Target field for Location : dcterms:spatial
Target field for Temperament : dc:type
Target field for Diet : dcterms:requires
Target field for Water : dcterms:medium
Target field for Size : dcterms:extent
Target field for Region of the Aquarium : dcterms:provenance
Target field for Breeding : dcterms:created
 |- Generate XML files in oaidata/fishproject-oai_dc/sample_1/xml
  |--> Compressiceps.xml
  |--> Frontosa.xml
  |--> GoldenJulie.xml
  |--> ConvictJulie.xml
  |--> PrincessCichlid.xml
  |--> Cylindricus.xml
  |--> Multi.xml
  |--> FiveBarCichlid.xml

Note : The community name is given as fishproject. In all subsequent steps, the community name should not be changed. It denotes the owner of the data and is used to tie all subsequent steps together. This command does not only generate the mapping file fishproject-oai_dc.csv, but the mapping is excecuted and the associated XML files are written to the output directory oaidata/fishproject-oai_dc/sample_1/xml/.

Excercise 2.1: Configure your own mapping

Note : An issue that we deal with here is that of semantic mapping and homogenisation of a research-specific vocabulary to a common, restricted vocabulary or schema, such as DublinCore. The mapping in the example above is quite "formal". For instance Diet is mapped to dcterms:requires, even though the intention of the DublinCore consortium by introducing the field requires was most probably not in the sense that a plant requires a specific diet. So, if we want to cover these specific biological properties we have to add an extented shema, which covers the property Diet or Temparament in this context.

For this use case we search for metadata specification of fish in lakes. A good starting point to find an appropriate metadata specification is the overview over Disciplinary Metadata provided by DCC. Just search in the area Biology for a standard, which describes properties of fish in lakes. (To be honest it is hard to find a schema that covers properties as Diet or Temperament. But the standards DarwinCore or EML may at least cover some of the Biology specific properties.)

Repeat the generation and do your own mapping by adapting the mapping rules in the mapfile to your needs. Below we demonstrate the contents of the mapfile. The seperator > is used to seperate the original field name and the target Dublincore element or term :

$ vi mapfiles/fishproject-oai_dc.csv
Common name>dc:identifier
Scientific name>dc:title
Location>dcterms:spatial
Temperament>dc:type
Diet>dcterms:requires
Water>dcterms:medium
Size>dcterms:extent
Region of the Aquarium>dcterms:provenance
Breeding>dcterms:created

Excercise 2.2: Check and validate the XML file

Check the results, the generated XML files in the output directory oaidata/fishproject-oai_dc/sample_1/xml/. For example, the contents of one of the files that is generated using the above maping can be seen bellow:

less oaidata/fishproject-oai_dc/sample_1/xml/Compressiceps.xml
<?xml version="1.0"?>

<metadata
    xmlns="http://example.org/myapp/"	
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://purl.org/dc/elements/1.1/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:dcterms="http://purl.org/dc/terms/">

        <dc:identifier>Compressiceps</dc:identifier>
        <dc:title>Haplochromis compressiceps</dc:title>
        <dc:type>Territorial</dc:type>
        <dcterms:created>Hard</dcterms:created>
        <dcterms:extent>5 inches</dcterms:extent>
        <dcterms:medium>PH 7.0 - 8.0, Temp. 73 - 77 F</dcterms:medium>
        <dcterms:provenance>Bottom</dcterms:provenance>
        <dcterms:requires>Omnivore</dcterms:requires>
        <dcterms:spatial>Lake Tanganyika</dcterms:spatial>
</metadata>

You can check the files using an XML validator software, for instance http://www.xmlvalidation.com/.

Excercise 2.3: Generate metadata records for seismic metadata

Here the sample metadata are given in a tab separated tsv file, locations.tsv.

./mdmanager.py --mode g -c seisnet -s samples/RAW_data/locations.tsv --mdprefix oai_dc --mdsubset locations --verb tab

Version:  	2.0
Run mode:   	Generation
Start : 	2016-12-15 10:05:05


|- Generation started : 2016-12-15 10:05:05
 |- From tab-separated spreadsheet
	samples/RAW_data/locations.tsv
 |- with original fields (headline)
	['Project', 'Creators', 'Publisher', 'Subject', 'Created', 'Location-Coord', 'Location-Name']
 |- using existing mapfile	mapfiles/seisnet-oai_dc.csv
 |- generate XML files in oaidata/seisnet-oai_dc/locations_1/xml
  |--> TienshanPamirGeodynamicsProjectTIPAGE.xml
  |--> LakeTobaseismicnetwork-Sumatra-Indonesia.xml
........

As you can see, we already provide a mapfile mapfiles/seisnet-oai_dc.csv with rules to map the original fields to Dublincore elements and terms. Feel free to change this mapping as you please, and check the effect on the 25 resulting XML files in oaidata/seisnet-oai_dc/locations_1/xml.