-
Notifications
You must be signed in to change notification settings - Fork 10
Adding a Subset or Combined Cohort to the daily MSK DMP cron job
Last update: Angelica Ochoa [email protected], 06/29/2018
- Add a cancer study row and a portal name column to the portal importer configuration spreadsheet
cancer_studies
tab for the new subset cohort. - Add a row to the portal importer configuration spreadsheet
portals
tab for the new portal column added. Feel free to copy and paste the values for one of the other subset or clinical cohorts as the values will not be used for our specific purposes but they are required to be populated for the sake of the Admin toolGData
/Config
bean setup. - (OPTIONAL) Update property
priority_studies
with new cancer study identifier in:
$PORTAL_CONFIG_HOME/properties/mskcc/portal.properties
$PORTAL_CONFIG_HOME/properties/private-beta/portal.properties
Changes to knowledgesystems/pipelines
This step is necessary if an email is to be sent after the subset or combined cohort study is updated. Simply add the portal column name defined in Step 1 above to the CancerStudyMetadata class. Be sure to add the new portal column key variable to the set of column keys called MSK_PORTAL_COLUMN_KEY_SET
.
Changes to knowledgesystems/cmo-pipelines
Some steps to do before adding any code changes:
- Add subset/combined cohort study and import trigger path to automation-environment.sh
- Add a new notification filename/import status flag to import-dmp-impact-data.sh
- Add flags for indicating subset or merge status to fetch-dmp-data-for-import.sh
For the merge/subset script to work, meta files must exist for each datatype. As of December 18, 2017 we do not support the import of data_SV.txt
and therefore do not have corresponding meta_SV.txt
files checked into mercurial. However we do update this datafile daily with the CVR pipeline and want to include this datatype while subsetting data. This requires a simple touch of the missing meta_SV.txt
files in every source data directory we need for the subset or merge. When we officially start supporting data_SV.txt
data in the portal we will no longer require touching these files before subsetting or merging data as they will already be checked into mercurial.
There are several examples of this in import-dmp-impact-data.sh. To reduce redundancy, we have grouped together MIXEDPACT
subsets by affiliate institutes such that the missing meta files we need are touched only once and then removed after the subsets complete.
bash $PORTAL_HOME/scripts/subset-impact-data.sh -i=<COHORT_STUDY_ID> -o=$<COHORT_DATA_HOME> -d=$<PATH_TO_SOURCE_STUDY_DATA> -f=<FILTER_CRITERIA> -s=$<PATH_TO_TEMP_SUBSET_FILENAME>
Example filter criteria: "INSTITUTE=Kings County Cancer Center"
- Check exit status of subset script and touch an import trigger file is successful (fetch-dmp-data-for-import.sh)
if [ $? -gt 0 ]; then
echo "<COHORT_NAME> subset failed! Study will not be updated in the portal."
sendFailureMessageMskPipelineLogsSlack "<COHORT_NAME> subset"
<COHORT_NAME>_SUBSET_FAIL=1
else
echo "<COHORT_NAME> subset successful!"
addCancerTypeCaseLists $<COHORT_DATA_HOME> "<COHORT_STUDY_ID>" "data_clinical_sample.txt" "data_clinical_patient.txt"
touch $<COHORT_IMPORT_TRIGGER>
fi
** Note ** Remove the touched meta file(s) after all subsets complete.
- Make a call to import the subset cohort as a temp study if cohort import trigger exists. Check exit status after.
if [ $DB_VERSION_FAIL -eq 0 ] && [ -f $<COHORT_IMPORT_TRIGGER> ]; then
echo "Importing <COHORT_STUDY_ID> study..."
echo $(date)
bash $PORTAL_HOME/scripts/import-temp-study.sh --study-id="<COHORT_STUDY_ID>" --temp-study-id="temporary_<COHORT_STUDY_ID>" --backup-study-id="yesterday_<COHORT_STUDY_ID>" --portal-name="<COHORT_PORTAL_COLUMN_NAME>" --study-path="$<COHORT_DATA_HOME>" --notification-file="$<COHORT_NOTIFICATION_FILENAME>" --tmp-directory="$JAVA_TMPDIR" --email-list="$email_list" --oncotree-version="${ONCOTREE_VERSION_TO_USE}" --importer-jar="$PORTAL_HOME/lib/msk-dmp-importer.jar" --transcript-overrides-source="mskcc"
if [ $? -eq 0 ]; then
<RESTART_AFTER_IMPORT_FLAG>=1 <-- UPDATE APPROPRIATE FLAG FOR RESTARTING THE RIGHT TOMCAT (ex: RESTART_AFTER_MSK_AFFILIATE_IMPORT=1)
IMPORT_FAIL_<COHORT_NAME>=0
fi
rm $<COHORT_IMPORT_TRIGGER>
else
if [ $DB_VERSION_FAIL -gt 0 ] ; then
echo "Not importing <COHORT_NAME> - database version is not compatible"
else
echo "Not importing <COHORT_NAME> - something went wrong with subsetting clinical studies for <COHORT NAME>."
fi
fi
- Commit or revert changes to mercurial repository
if [ $IMPORT_FAIL_<COHORT_NAME> -gt 0 ]; then
sendFailureMessageMskPipelineLogsSlack "<COHORT_NAME> import"
echo "<COHORT_NAME> subset and/or updates failed! Reverting data to last commit."
cd $<COHORT_DATA_HOME> ; $HG_BINARY update -C ; find . -name "*.orig" -delete
else
sendSuccessMessageMskPipelineLogsSlack "<COHORT_NAME>"
echo "Committing <COHORT_NAME> data"
cd $<COHORT_DATA_HOME> ; find . -name "*.orig" -delete ; $HG_BINARY add * ; $HG_BINARY commit -m "Latest <COHORT_NAME> dataset"dataset"
fi
- If a subset fails, send the appropriate email.
EMAIL_BODY="Failed to subset <COHORT_NAME> data. Subset study will not be updated."
if [ $<COHORT_NAME>_SUBSET_FAIL -gt 0 ]; then
echo -e "Sending email $EMAIL_BODY"
echo -e "$EMAIL_BODY" | mail -s "<COHORT_NAME> Subset Failure: Study will not be updated." $email_list
fi
** Note ** An import failure email will already be sent if the import fails by the import-temp-study.sh script so there's no need to send another email here about import failures.
- If subsetting multiple studies by a clinical attribute then individual calls to generate-clinical-subset.py will need to be made to generate a subset of sample ids for each study that we are subsetting from. Normally this step is executed by the subset-impact-data.sh script.
After each subset of sample ids is generated, they must be merged together into a single line-delimited file of sample ids. For QC purposes each file should be checked to see if it's empty or not before appending the list of sample ids to the main subset file that will be used when calling the merge script. If any of the subset files are empty then set the <COHORT_NAME>_SUBSET_FAIL>=1
.
- Call merge.py and check exit status
if [ $<COHORT_NAME>_SUBSET_FAIL -eq 0 ]; then
$PYTHON_BINARY $PORTAL_HOME/scripts/merge.py -d $<COHORT_NAME>_DATA_HOME -i "<COHORT_STUDY_ID>" -m "true" -s $JAVA_TMPDIR/<cohort_name>_subset_samples.txt $<SOURCE_STUDY_DATA_HOME1> $<SOURCE_STUDY_DATA_HOME2> ...
if [ $? -gt 0 ]; then
echo "<COHORT_NAME> subset failed! <COHORT_NAME> study will not be updated in the portal."
sendFailureMessageMskPipelineLogsSlack "<COHORT_NAME> merge"
<COHORT_NAME>_SUBSET_FAIL=1
fi
fi
- Repeat steps 3-5 above.