Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check existing kaggle models #4

Open
kaichop opened this issue Jun 11, 2024 · 5 comments
Open

check existing kaggle models #4

kaichop opened this issue Jun 11, 2024 · 5 comments
Assignees

Comments

@kaichop
Copy link
Contributor

kaichop commented Jun 11, 2024

Some people have shared their code using xgboost and random foreset for the prediction. We can borrow the code both for data processing and for prediction, so that we do not need to start from scratch. Compile the information in this issue.

This is a pinned example https://www.kaggle.com/code/andrewdblevins/leash-tutorial-ecfps-and-random-forest that we can reproduce and learn how to use parquet to process the data, and how to use 42 as random seed to ensure consistency of training/testing using different models in the future.

@wangwpi wangwpi self-assigned this Jun 11, 2024
@wangwpi
Copy link
Contributor

wangwpi commented Jun 11, 2024

I'm going to check the pinned example (random forest) and the BERT-fine tuning model on the kaggle.

@wangwpi
Copy link
Contributor

wangwpi commented Jun 11, 2024

The pinned example (random forest) has been reproduced and the public score for that model is 0.263. The jupyter notebook has been uploaded in the models/Leash_Tutorial_test.ipynb in this repository.

@kaichop
Copy link
Contributor Author

kaichop commented Jun 11, 2024 via email

@wangwpi
Copy link
Contributor

wangwpi commented Jun 12, 2024 via email

@wangwpi
Copy link
Contributor

wangwpi commented Jun 14, 2024

I have uoloaded my notebook for BERT fine tunning (use 60000 data), and a current Neural Network Model using all split data (230M training, 56M validation). The morgan fingerprint for all split data are generated in trunks (500K each trunk) as numpy array file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants