Feature hpf training api #11

IntegralProgrammer · 2020-01-16T17:05:05Z

This PR gives the NCR API the ability to receive training jobs submitted via HTTP (with a sample web interface) and execute the training job on HPF. This works through the following steps:

Training jobs are submitted to the NCR API. Each training job submission consists of a:
- OBO format ontology file
- A chosen root for the OBO file
- The name assigned to the NCR model upon training completion
Once the user submits their training job, this training job is sent to a WebDAV server specified by environment variables. Specifically, the WebDAV server receives:
- The OBO ontology file, under OBO_WEBDAV_URL/<jobID>.obo
- A JSON object describing the training arguments, job id, and chosen name for the training session, under QSUB_WEBDAV_URL/<jobID>.json
- A file containing the word READY, under READY_WEBDAV_URL/<jobID>
The HPF node periodically executes (possible via crontab) python generate_qsub_job.py ~/qsub ~/uploaded_obo. This causes READY_WEBDAV_URL to be checked for training jobs that need to be executed. If any Job IDs are available. the corresponding OBO and JSON files are downloaded from OBO_WEBDAV_URL/<jobID>.obo and QSUB_WEBDAV_URL/<jobID>.json respectively and a job is submitted to QSUB to run the training task.
As the training task executes on HPF, progress log messages can be sent to LOGGING_WEBDAV_URL/<jobID>_<messageID> and then viewed by the submitting client by visiting the /log/<jobID> API endpoint.
When the training completes, the trained model is uploaded to the WebDAV server under OUTPUT_WEBDAV_URL/<jobID>_config.json, OUTPUT_WEBDAV_URL/<jobID>_ncr_weights.h5, OUTPUT_WEBDAV_URL/<jobID>_onto.json
If all goes well, the name assigned to the model at the start of the training is written to COMPLETE_WEBDAV_URL/JOBCOMPLETE_<jobID>. If there is a failure, the string FAILED is written to FAILED_WEBDAV_URL/JOBFAIL_<jobID>.
When the /models API endpoint is hit, the WebDAV server is queried for completed model training jobs. When completed jobs are available, the models are downloaded and added to the model list of the NCR API for use with the /match/ and /annotate/ text analysis methods.

…erver These training jobs (running on HPF) write their trained models to the WebDAV server The NCR Web Application checks the WebDAV server for completed model training jobs, then saves and imports them Due to memory constraints, NCR models are only loaded from disk when needed. The models are deleted from memory once they are no longer needed

…e are in AUTOTEST mode. Instead, use only the provided models

IntegralProgrammer added 5 commits January 15, 2020 15:10

Re-factoring generate_qsub_job.py

b55a384

Modifying train.py to log to WebDAV server instead of a file on HPF

cc497b5

Skip checking for / downloading newly trained models from WebDAV if w…

f44c8f4

…e are in AUTOTEST mode. Instead, use only the provided models

Add HTML template for Flask for the submission of NCR training jobs

f9cf046

IntegralProgrammer requested a review from sdumitriu January 16, 2020 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature hpf training api #11

Feature hpf training api #11

IntegralProgrammer commented Jan 16, 2020

Feature hpf training api #11

Are you sure you want to change the base?

Feature hpf training api #11

Conversation

IntegralProgrammer commented Jan 16, 2020