Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature hpf training api #11

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

IntegralProgrammer
Copy link
Member

This PR gives the NCR API the ability to receive training jobs submitted via HTTP (with a sample web interface) and execute the training job on HPF. This works through the following steps:

  1. Training jobs are submitted to the NCR API. Each training job submission consists of a:

    • OBO format ontology file
    • A chosen root for the OBO file
    • The name assigned to the NCR model upon training completion
  2. Once the user submits their training job, this training job is sent to a WebDAV server specified by environment variables. Specifically, the WebDAV server receives:

    • The OBO ontology file, under OBO_WEBDAV_URL/<jobID>.obo
    • A JSON object describing the training arguments, job id, and chosen name for the training session, under QSUB_WEBDAV_URL/<jobID>.json
    • A file containing the word READY, under READY_WEBDAV_URL/<jobID>
  3. The HPF node periodically executes (possible via crontab) python generate_qsub_job.py ~/qsub ~/uploaded_obo. This causes READY_WEBDAV_URL to be checked for training jobs that need to be executed. If any Job IDs are available. the corresponding OBO and JSON files are downloaded from OBO_WEBDAV_URL/<jobID>.obo and QSUB_WEBDAV_URL/<jobID>.json respectively and a job is submitted to QSUB to run the training task.

  4. As the training task executes on HPF, progress log messages can be sent to LOGGING_WEBDAV_URL/<jobID>_<messageID> and then viewed by the submitting client by visiting the /log/<jobID> API endpoint.

  5. When the training completes, the trained model is uploaded to the WebDAV server under OUTPUT_WEBDAV_URL/<jobID>_config.json, OUTPUT_WEBDAV_URL/<jobID>_ncr_weights.h5, OUTPUT_WEBDAV_URL/<jobID>_onto.json

  6. If all goes well, the name assigned to the model at the start of the training is written to COMPLETE_WEBDAV_URL/JOBCOMPLETE_<jobID>. If there is a failure, the string FAILED is written to FAILED_WEBDAV_URL/JOBFAIL_<jobID>.

  7. When the /models API endpoint is hit, the WebDAV server is queried for completed model training jobs. When completed jobs are available, the models are downloaded and added to the model list of the NCR API for use with the /match/ and /annotate/ text analysis methods.

…erver

These training jobs (running on HPF) write their trained models to the WebDAV server

The NCR Web Application checks the WebDAV server for completed model training jobs, then saves and imports them

Due to memory constraints, NCR models are only loaded from disk when needed. The models are deleted from memory once they are no longer needed
…e are in AUTOTEST mode. Instead, use only the provided models
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant