Named entity recognition (NER) is an important technique that promises to improve information classification and retrieval in biomedical natural language processing (NLP). However, existing approaches primarily rely on either laborious manual curation or feature engineering. Here we adopt deep learning techniques in NLP and repurpose the vast amount of entity-freetext pairs available in the BioSample to train a scalable NER model.
Key notebooks
Code | Usage |
---|---|
mergeEntities | Merge all the highly similar BioSample entities using cosine similarities |
deep_sra_train_and_test | Train an entity recognition model using SRA meta data with entity groupings from mergeEntities |
deep_sra_predict | Classify text entity using the trained NER model |
Parsing and merging of BioSample data | --- |
Independent validation
Code | Usage |
---|---|
validationDataGenration.ipynb | validation data generation for comparison against curation and Metamap |
NER in batch | predict NER based on all possible sentence segments |
scoreAgainstManualCuration_entity_membership | score against manual curation |
Parse metamap data | |
Score against metamap | |
Auxilary notebooks that probably not used or not critical towards understanding of manuscripts |
Code | Usage |
---|---|
downloadFromPMC | download the pubmed text |
train_pmc_word2vec.ipynb | Train a word2vec model based on pubmed text, used the pretrained one in the manuscript at the end |
uploadToSynapse |
Please download the data from the following websites:
File name | Usage |
---|---|
allSRS.pickle.gz | all BioSample SRS annotations |
word vectors | Spacy word vector model |
meta data | bioSample to Study mapping table |
pretrained models | pretrained LSTM and Spacy word models |
Unused data location | Usage |
---|---|
https://www.synapse.org/#!Synapse:syn15661258 | all SRX annotations |
https://www.synapse.org/#!Synapse:syn16805240 | PUBMED ID conversions |
Manuscript auxilary data | Description |
---|---|
Machine annotated validation data | Example output of deep NER annotations from NER in batch |
Data curated using dataturk |
If u have anaconda, install relevant packages using following command lines:
conda env create -f environment.yml
source activate deep_nlp_cpu
This work is under MIT license.
#!head -n 20 ./pubmed/PMC0019XXXXX/PMC1913286.txt