Skip to content

Latest commit

 

History

History
127 lines (95 loc) · 7.15 KB

README.md

File metadata and controls

127 lines (95 loc) · 7.15 KB

MoleRec

Publication PRs License Stars

Official implementation for our paper:

MoleRec: Combinatorial Drug Recommendation with Substructure-Aware Molecular Representation Learning

Nianzu Yang, Kaipeng Zeng, Qitian Wu, Junchi Yan* (* denotes correspondence)

Proceedings of the ACM Web Conference 2023 (TheWebConf (a.k.a. WWW) 2023)

News 🎉

MoleRec has been incorporated into the PyHealth package as a benchmark method for the combinatorial drug recommendation task! 👏 Stars

Folder Specification

  • data/ folder contains necessary data or scripts for generating data.
    • drug-atc.csv, ndc2atc_level4.csv, ndc2rxnorm_mapping.txt: mapping files for drug code transformation
    • atc2rxnorm.pkl: It maps ATC-4 code to rxnorm code and then query to drugbank.
    • idx2SMILES.pkl: Drug ID (we use ATC-4 level code to represent drug ID) to drug SMILES string dictionary.
    • drug-DDI.csv: A file containing the drug DDI information which is coded by CID. This file is large and you can download it from https://drive.google.com/file/d/1s3sHmz9ueVA8YAGTARY8jwrhRdRvVaXs/view?usp=sharing.
    • ddi_mask_H.pkl: A mask matrix containing the relations between molecule and substructures. If drug molecule $i$ contains substructure $j$, the $j$-th column of $i$-the row of the matrix is set to 1.
    • substructure_smiles.pkl: A list containing the smiles of all the substructures.
    • ddi_mask_H.py: The python script responsible for generating ddi_mask_H.pkl and substructure_smiles.pkl.
    • processing.py: The python script responsible for generating voc_final.pkl, records_final.pkl, data_final.pkl and ddi_A_final.pkl.
  • src/ folder contains all the source code.
    • modules/: Code for model definition.
    • utils.py: Code for metric calculations and some data preparation.
    • training.py: Code for the functions used in training and evaluation.
    • main.py: Train or evaluate our MoleRec Model.

Remark: data/ only contains part of the data. See the Data Generation section for more details.

Dependency

The MoleRec.yml lists all the dependencies of the MoleRec. To quickly set up a environment for our model, use the following command

conda env create -f MoleRec.yml

Data Generation

The usage of MIMIC-III datasets requires certification, so it's illegal for us to provide the raw data here. Therefore, if you want to have access to MIMIC-III datasets, you have to obtain the certification first and then download it from https://physionet.org/content/mimiciii/.

After downloading the MIMIC-III dataset, put the three csv file PRESCRIPTIONS.csv, DIAGNOSES_ICD.csv and PROCEDURES_ICD.csv from the raw data into the data/ folder and generate the necessary files for training and evaluating apart from the files that we already have provided in thte data/ folder, using the command as below:

cd data
python processing.py

For the explanation of each output file, please refer to the SafeDrug repository. Note that in our paper, we follow the same data processing procedure as the SafeDrug after the commit c7218d0.

If you want to re-generate ddi_matrix_H.pkl and substructure_smiles.pkl, use the following command:

cd data
python ddi_mask_H.py

Note that the BRICS decomposition method generates substructures in a random order. Since that ddi_matrix_H.pkl and substructure_smiles.pkl are effected by this order, if you re-generate these two files, please re-train the model. For convenience, we've already provided the generated result by us in data/ folder, which can be used for training and evaluating directly.

Run the Code

We provide two versions of our model. They learn the substructure representations using embedding table and GNNs, respectively. If you want to train or evaluate our model, please change your working directory first via:

cd src

Embedding Table Version

To train the model, use the following command:

python main.py --device ${device} --embedding --lr ${learning rate} --dp ${dropout rate} --dim ${dim} --target_ddi ${expected ddi} --coef ${coefficient of annealing weight} --epochs ${epochs}

To evaluate a well-trained model, use the following command:

python main.py --Test --embedding --resume_path ${model_path}

We've provide our well-trained model in folder best_models/, to evaluate it, use the command

python main.py --Test --embedding --resume_path ../best_models/embedding_table/MoleRec.model

GNNs Version

This version learns the substructure representation using GNNs, which is more powerful but has more parameters. You can use the following command to train the model:

python main.py --device ${device} --lr ${learning rate} --dp ${dropout rate} --dim ${dim} --target_ddi ${expected ddi} --coef ${coefficient of annealing weight} --epochs ${epochs}

To evaluate a well-trained model, use the following command:

python main.py --Test --resume_path ${model_path}

We also provide a well-trained model weight for this version, which can be evaluated by:

python main.py --Test --resume_path ../best_models/GNN/MoleRec.model

Citation

If you find our work useful in your research, please consider citing:

@inproceedings{yang2023molerec,
  title={MoleRec: Combinatorial Drug Recommendation with Substructure-Aware Molecular Representation Learning},
  author={Yang, Nianzu and Zeng, Kaipeng and Wu, Qitian and Yan, Junchi},
  booktitle={Proceedings of the ACM Web Conference 2023},
  pages={4075--4085},
  year={2023}
}

Welcome to contact us [email protected] or [email protected] for any question.

Acknowledgement

We sincerely thank these repositories GAMENet and SafeDrug for their well-implemented pipeline upon which we build our codebase.