Skip to content

Provided Software

Marlin Schäfer edited this page Dec 14, 2021 · 13 revisions

This page describes the usage of the two scripts generate_data.py and evaluate.py provided in this repository. It does not describe any auxiliary code or the details of the implementation. If you encounter any problems, find any bugs, or need help please contact us via mail at [email protected] or via Slack. For more details see our support page.

Requirements

To run the code in this repository a working installation of Python 3.7 or higher as well as an adequately new version of PyCBC are required. If you need to install Python 3.7 make sure to also install the appropriate python development libraries. For Ubuntu the commands would be

sudo apt-get install python3.7
sudo apt-get install python3.7-dev

We recommend to use a virtual environment for this mock data challenge. To create one you can use virtualenv

virtualenv -p python3.7 <env-name>
source <env-name>

To install an appropriate version of PyCBC simply install the requirements after cloning the repository by

pip install --upgrade pip setuptools
pip install -r requirements.txt

To download the code you can simply clone this repository to a suitable location on your machine by executing the command below in the desired directory.

git clone https://github.com/gwastro/ml-mock-data-challenge-1.git

generate_data.py

This script contains the code to generate mock data for testing. To use it multiple options can be specified. An example call specifying the most common options would be

./generate_data.py \
--dataset 1 \
--output-injection-file injections.hdf \
--output-foreground-file foreground.hdf \
--output-background-file background.hdf \
--seed 42 \
--start-offset 0 \
--duration 32000 \
--verbose
  • The --dataset option specifies how the noise is generated and which injections are made. For details please refer to this page.
  • All options prefixed by output specify where files generated by the code will be stored. The --output-injection-file contains the parameters of the injected signals. --output-background-file contains the pure detector noise. --output-foreground-file contains the same noise with signals injected into it. For details on the structure of the foreground and background files please refer to this page.
  • The --seed is used to make the noise and signal generation reproducible. Two calls to this function with the same --dataset, --seed, --start-offset, and --duration will yield identical results. If the seed is not specified it will default to 0 and not be random! To use a random seed on each invocation of the program use a negative number as seed.
  • The --start-offset specifies at which time to start generating noise. It must be greater or equal zero. All noise starts to be generated at the same reference time. You may want to alter this value if you want to produce a large amount of noise on multiple machines in parallel. The --start-offset for the second call to the function would in that case be the value given to the --duration of the first call. In other words, this option tells the code how much data to skip in the beginning and where to start generating.
  • The --duration specifies how much data is generated (in seconds). Note that in total only 7111579 seconds of data are available and for technical reasons no more than 7024699 should be requested. We recommend to stay way below these limits.
  • The option --verbose prints status updates to the screen.

Additionally, you may want to only generate injections once and use them for multiple data sets. In this case you can omit the option --output-injection-file and instead set the option --injection-file. Pass the path to the injection file you want to use. The --injection-file is expected to be of the format output by --output-injection-file.

For further options and a description of them please refer to

./generate_data.py -h

Note that the code will download a file called segments.csv. This file contains information on which GPS times to use for data generation. Irrespective of the data set specified by --dataset data will be generated in these segments. ATTENTION! If you specify --dataset 4 the code will start to download a large (~94 GB) file containing real noise downsampled to 2 kHz. You can interrupt this download at any time and the function will pick up where it left off. However, the code is not able to generate any data for data set 4 before this file is downloaded completely. You can also download the file directly via

python -c "from generate_data import download_data; download_data()"

or from the URL https://www.atlas.aei.uni-hannover.de/work/marlin.schaefer/MDC/real_noise_file.hdf.

For more control over the data generating process the functions from the script can be called directly. We consider this advanced usage and do not document it beyond the comments in the code.

evaluate.py

This script contains the functionality to get the false-alarm rate (FAR) as well as the sensitivity of the search algorithm. As input it requires the file containing the injections, the file containing the foreground input data, as well as the event files returned by the search algorithm applied to the foreground and background data. It returns a file of the HDF5 format containing many different datasets. The most important of these are labeled far and sensitive-distance. They are of the same length and values of the sensitive-distance correspond to the far value at the same index. To plot them, they have to be sorted by the far values. An example-call to the script would be

./evaluate.py \
--injection-file injections.hdf \
--foreground-events <path to output of algorithm on foreground data> \
--foreground-files foreground.hdf \
--background-events <path to output of algorithm on background data> \
--output-file eval-output.hdf \
--verbose

The options mean the following

  • The option --injection-file specifies the injections that were used to create the foreground data. It corresponds to the output of generate_data.py --output-injection-file or the path given to generate_data.py --injection-file.
  • The option --foreground-events specifies the output of the search algorithm that was obtained using the foreground file returned by generate_data.py --output-foreground-file. For details on the structure of these files please refer to this page. Multiple paths may be provided if the input data was split into multiple parts.
  • The option --foreground-files specifies the foreground data that was used as input to the algorithm. This file is only used to determine which injections were actually contained in the foreground data and how much data was analyzed. It has to be the file created by generate_data.py --output-foreground-file. Multiple paths may be provided if the input data was split into multiple parts.
  • The option --background-events specifies the output of the search algorithm that was obtained using the background file returned by generate_data.py --output-background-file. For details on the structure of these files please refer to this page. Multiple paths may be provided if the input data was split into multiple parts.
  • The option --output-file specifies where the analysis output should be stored.
  • The option --verbose tells the script to print status updates.
  • An option --force exists to allow the code to overwrite existing files.
Clone this wiki locally