check existing kaggle models #4

kaichop · 2024-06-11T13:59:27Z

Some people have shared their code using xgboost and random foreset for the prediction. We can borrow the code both for data processing and for prediction, so that we do not need to start from scratch. Compile the information in this issue.

This is a pinned example https://www.kaggle.com/code/andrewdblevins/leash-tutorial-ecfps-and-random-forest that we can reproduce and learn how to use parquet to process the data, and how to use 42 as random seed to ensure consistency of training/testing using different models in the future.

wangwpi · 2024-06-11T14:23:43Z

I'm going to check the pinned example (random forest) and the BERT-fine tuning model on the kaggle.

wangwpi · 2024-06-11T17:39:03Z

The pinned example (random forest) has been reproduced and the public score for that model is 0.263. The jupyter notebook has been uploaded in the models/Leash_Tutorial_test.ipynb in this repository.

kaichop · 2024-06-11T17:59:06Z

is it affected by "LIMIT 30000" in the SQL code?

…

On Tue, Jun 11, 2024 at 1:39 PM Peng Wang ***@***.***> wrote: The pinned example (random forest) has been reproduced and the public score for that model is 0.263. The jupyter notebook has been uploaded in the models/Leash_Tutorial_test.ipynb in this repository. — Reply to this email directly, view it on GitHub <#4 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABNG3OFU4JJPTMFTK2CODB3ZG4Y4ZAVCNFSM6AAAAABJENBHJCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRRGI4TCOJXGQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

wangwpi · 2024-06-12T00:44:19Z

Yes you are right, in the tutorial the model was trained only on 30000+30000 samples, I will try to train using the whole training dataset and see the performance.

…

On Tue, Jun 11, 2024 at 1:59 PM Kai Wang ***@***.***> wrote: is it affected by "LIMIT 30000" in the SQL code? On Tue, Jun 11, 2024 at 1:39 PM Peng Wang ***@***.***> wrote: > The pinned example (random forest) has been reproduced and the public > score for that model is 0.263. The jupyter notebook has been uploaded in > the models/Leash_Tutorial_test.ipynb in this repository. > > — > Reply to this email directly, view it on GitHub > <#4 (comment)>, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/ABNG3OFU4JJPTMFTK2CODB3ZG4Y4ZAVCNFSM6AAAAABJENBHJCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRRGI4TCOJXGQ> > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> > — Reply to this email directly, view it on GitHub <https://urldefense.com/v3/__https://github.com/WGLab/Project_Belka/issues/4*issuecomment-2161324155__;Iw!!IBzWLUs!RHq3vPee1-zokcwQSBOf7k324RAbwD0PQAr4pdszY2Eok80_oT05ln4zEkOZRYFZ3oKNZWVgIE2rl8Ja6XmRBgdKzXBd$>, or unsubscribe <https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/A67BOBXFOYZSEWFEOSFCPPLZG43H7AVCNFSM6AAAAABJENBHJCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRRGMZDIMJVGU__;!!IBzWLUs!RHq3vPee1-zokcwQSBOf7k324RAbwD0PQAr4pdszY2Eok80_oT05ln4zEkOZRYFZ3oKNZWVgIE2rl8Ja6XmRBq_n0HQ6$> . You are receiving this because you were assigned.Message ID: ***@***.***>

wangwpi · 2024-06-14T03:48:10Z

I have uoloaded my notebook for BERT fine tunning (use 60000 data), and a current Neural Network Model using all split data (230M training, 56M validation). The morgan fingerprint for all split data are generated in trunks (500K each trunk) as numpy array file.

wangwpi self-assigned this Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

check existing kaggle models #4

check existing kaggle models #4

kaichop commented Jun 11, 2024

wangwpi commented Jun 11, 2024

wangwpi commented Jun 11, 2024

kaichop commented Jun 11, 2024 via email

wangwpi commented Jun 12, 2024 via email

wangwpi commented Jun 14, 2024

check existing kaggle models #4

check existing kaggle models #4

Comments

kaichop commented Jun 11, 2024

wangwpi commented Jun 11, 2024

wangwpi commented Jun 11, 2024

kaichop commented Jun 11, 2024 via email

wangwpi commented Jun 12, 2024 via email

wangwpi commented Jun 14, 2024