Adding a Subset or Combined Cohort to the daily MSK DMP cron job

Last update: Angelica Ochoa [email protected], 06/29/2018

Portal Importer Configuration Spreadsheet Setup

Add a cancer study row and a portal name column to the portal importer configuration spreadsheet cancer_studies tab for the new subset cohort.
Add a row to the portal importer configuration spreadsheet portals tab for the new portal column added. Feel free to copy and paste the values for one of the other subset or clinical cohorts as the values will not be used for our specific purposes but they are required to be populated for the sake of the Admin tool GData/Config bean setup.
(OPTIONAL) Update property priority_studies with new cancer study identifier in:

$PORTAL_CONFIG_HOME/properties/mskcc/portal.properties
$PORTAL_CONFIG_HOME/properties/private-beta/portal.properties

Changes to knowledgesystems/pipelines

This step is necessary if an email is to be sent after the subset or combined cohort study is updated. Simply add the portal column name defined in Step 1 above to the CancerStudyMetadata class. Be sure to add the new portal column key variable to the set of column keys called MSK_PORTAL_COLUMN_KEY_SET.

Changes to knowledgesystems/cmo-pipelines

Setting up

Some steps to do before adding any code changes:

Add subset/combined cohort study and import trigger path to automation-environment.sh
Add a new notification filename/import status flag to import-dmp-impact-data.sh
Add flags for indicating subset or merge status to fetch-dmp-data-for-import.sh

Subsetting & Merging basics

For the merge/subset script to work, meta files must exist for each datatype. As of December 18, 2017 we do not support the import of data_SV.txt and therefore do not have corresponding meta_SV.txt files checked into mercurial. However we do update this datafile daily with the CVR pipeline and want to include this datatype while subsetting data. This requires a simple touch of the missing meta_SV.txt files in every source data directory we need for the subset or merge. When we officially start supporting data_SV.txt data in the portal we will no longer require touching these files before subsetting or merging data as they will already be checked into mercurial.

There are several examples of this in import-dmp-impact-data.sh. To reduce redundancy, we have grouped together MIXEDPACT subsets by affiliate institutes such that the missing meta files we need are touched only once and then removed after the subsets complete.

Subsetting a single study by a clinical attribute

Call subset-impact-data.sh

bash $PORTAL_HOME/scripts/subset-impact-data.sh -i=<COHORT_STUDY_ID> -o=$<COHORT_DATA_HOME> -d=$<PATH_TO_SOURCE_STUDY_DATA> -f=<FILTER_CRITERIA> -s=$<PATH_TO_TEMP_SUBSET_FILENAME>

Example filter criteria: "INSTITUTE=Kings County Cancer Center"

Check exit status of subset script and touch an import trigger file is successful (fetch-dmp-data-for-import.sh)

if [ $? -gt 0 ]; then
    echo "<COHORT_NAME> subset failed! Study will not be updated in the portal."
    sendFailureMessageMskPipelineLogsSlack "<COHORT_NAME> subset"    
    <COHORT_NAME>_SUBSET_FAIL=1
else
    echo "<COHORT_NAME> subset successful!"
    addCancerTypeCaseLists $<COHORT_DATA_HOME> "<COHORT_STUDY_ID>" "data_clinical_sample.txt" "data_clinical_patient.txt"
    touch $<COHORT_IMPORT_TRIGGER>
fi

** Note ** Remove the touched meta file(s) after all subsets complete.

Make a call to import the subset cohort as a temp study if cohort import trigger exists. Check exit status after.

if [ $DB_VERSION_FAIL -eq 0 ] && [ -f $<COHORT_IMPORT_TRIGGER> ]; then
    echo "Importing <COHORT_STUDY_ID> study..."
    echo $(date)
    bash $PORTAL_HOME/scripts/import-temp-study.sh --study-id="<COHORT_STUDY_ID>" --temp-study-id="temporary_<COHORT_STUDY_ID>" --backup-study-id="yesterday_<COHORT_STUDY_ID>" --portal-name="<COHORT_PORTAL_COLUMN_NAME>" --study-path="$<COHORT_DATA_HOME>" --notification-file="$<COHORT_NOTIFICATION_FILENAME>" --tmp-directory="$JAVA_TMPDIR" --email-list="$email_list" --oncotree-version="${ONCOTREE_VERSION_TO_USE}" --importer-jar="$PORTAL_HOME/lib/msk-dmp-importer.jar" --transcript-overrides-source="mskcc"
    if [ $? -eq 0 ]; then
        <RESTART_AFTER_IMPORT_FLAG>=1 <-- UPDATE APPROPRIATE FLAG FOR RESTARTING THE RIGHT TOMCAT (ex: RESTART_AFTER_MSK_AFFILIATE_IMPORT=1)
        IMPORT_FAIL_<COHORT_NAME>=0
    fi
    rm $<COHORT_IMPORT_TRIGGER>
else
    if [ $DB_VERSION_FAIL -gt 0 ] ; then
        echo "Not importing <COHORT_NAME> - database version is not compatible"
    else
        echo "Not importing <COHORT_NAME> - something went wrong with subsetting clinical studies for <COHORT NAME>."
    fi
fi

Commit or revert changes to mercurial repository

if [ $IMPORT_FAIL_<COHORT_NAME> -gt 0 ]; then
    sendFailureMessageMskPipelineLogsSlack "<COHORT_NAME> import"
    echo "<COHORT_NAME> subset and/or updates failed! Reverting data to last commit."
    cd $<COHORT_DATA_HOME> ; $HG_BINARY update -C ; find . -name "*.orig" -delete
else
    sendSuccessMessageMskPipelineLogsSlack "<COHORT_NAME>"
    echo "Committing <COHORT_NAME> data"
    cd $<COHORT_DATA_HOME> ; find . -name "*.orig" -delete ; $HG_BINARY add * ; $HG_BINARY commit -m "Latest <COHORT_NAME> dataset"dataset"
fi

If a subset fails, send the appropriate email.

EMAIL_BODY="Failed to subset <COHORT_NAME> data. Subset study will not be updated."
if [ $<COHORT_NAME>_SUBSET_FAIL -gt 0 ]; then 
    echo -e "Sending email $EMAIL_BODY"
    echo -e "$EMAIL_BODY" | mail -s "<COHORT_NAME> Subset Failure: Study will not be updated." $email_list
fi

** Note ** An import failure email will already be sent if the import fails by the import-temp-study.sh script so there's no need to send another email here about import failures.

Subsetting multiple studies by a clinical attribute

If subsetting multiple studies by a clinical attribute then individual calls to generate-clinical-subset.py will need to be made to generate a subset of sample ids for each study that we are subsetting from. Normally this step is executed by the subset-impact-data.sh script.

After each subset of sample ids is generated, they must be merged together into a single line-delimited file of sample ids. For QC purposes each file should be checked to see if it's empty or not before appending the list of sample ids to the main subset file that will be used when calling the merge script. If any of the subset files are empty then set the <COHORT_NAME>_SUBSET_FAIL>=1.

Call merge.py and check exit status

if [ $<COHORT_NAME>_SUBSET_FAIL -eq 0 ]; then
    $PYTHON_BINARY $PORTAL_HOME/scripts/merge.py  -d $<COHORT_NAME>_DATA_HOME -i "<COHORT_STUDY_ID>" -m "true" -s $JAVA_TMPDIR/<cohort_name>_subset_samples.txt $<SOURCE_STUDY_DATA_HOME1> $<SOURCE_STUDY_DATA_HOME2> ... 
    if [ $? -gt 0 ]; then
        echo "<COHORT_NAME> subset failed! <COHORT_NAME> study will not be updated in the portal."
        sendFailureMessageMskPipelineLogsSlack "<COHORT_NAME> merge"
        <COHORT_NAME>_SUBSET_FAIL=1
    fi
fi

Repeat steps 3-5 above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly