Tanzania is the largest country in East-Africa, with a population of approximately 60 million people. But of those 60 million people, only 47% have access to basic water, while the rest of the population have no choice but to drink dirty water from unsafe sources. As a result, 4000 children each year die from preventable diseases due to unsafe water. Safe water is scarce, and often women and children have to spend two to seven hours to collect clean water (WaterAid, 2016). This is quite the predicament. Water is a basic need and right for all human beings. The purpose of this work is to answer the following questions:
- Can machine learning become a valuable addition to the Tanzanian government in battling water scarcity?
- What is the best way to predict the functional state of Tanzanian water pumps?
- Which data preparation algorithms improve the predictive capabilities of a machine learning algorithm on this dataset?
This dataset includes 5 notebooks:
tanzania_dataset_analysis.ipynb
: Extended analysis of the datasettanzania_train_dataset_preprocessing.ipynb
: Preprocessing of the training settanzania_train_dataset_preprocessing.ipynb
: Preprocessing of the test set based on the training set preprocessingtanzania_classifier_training.ipynb
: Training & Hyper-parameter tuning of several machine learning classifierstanzania_advanced_training.ipynb
: Advanced ML training algorithms for dealing with imbalanced datasets
https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/