This project aimed to predict the type of dwelling unit a Palestinian family lives in, based on various features in a dataset collected from the Palestinian Expenditure and Consumption Survey in 2011 and 2014. The dataset was preprocessed to standardize missing values, unify currency, and remove irrelevant features. A total of 20 features were carefully selected through the use of Extra Trees Classifier and Decision Tree algorithm to determine feature importance. A Decision Tree model was built using these selected features and the accuracy was evaluated using both training and testing data. The end goal of this project was to accurately predict the type of dwelling unit a Palestinian family lives in and help to understand the factors that influence the type of housing they live in.
This project includes 3 code files for preprocessing the data, selecting features, and building a decision tree.
PreProcessing.R: This file contains the code for preprocessing the data, which includes:
- NULL Standardization.
- Features Merging.
- Currency Unification.
- Features Removal.
- Categorizing Numeric Data.
DecisionTree.R: This file contains the code for building the decision tree using the C50 package in R.
ExtraTreesClassifier.py: This file contains the code for applying the Extra Trees Classifier algorithm to determine the most important features, which are then used to build the decision tree.
For a comprehensive understanding of the project, refer to the Report.pdf file which includes a thorough explanation of the methodology, outcomes, and conclusions. The FeaturesDetails.xlsx file also provides a detailed description of all 840 features used in the project. Additionally, a sample from the dataset used in this project, the SefSec_2014_HH_weightNew.sav file, is also included in the repository.
- Clone or download the repository to your local machine.
- Ensure that you have all the prerequisites installed.
- Install R needed packages by running this code
install.packages("haven", "tidyverse", "Hmisc", "C50")
. - Install python needed packages by running the command
pip install sklearn pandas
. - Execute the code in PreProcessing.R line by line.
- Run the ExtraTreesClassifier.py code, and copy the best 30 features from the console output.
- In DecisionTree.R, update the
features
to the ones you copied. - Execute the code in DecisionTree.R line by line.