This project was a collaborative effort involving two more students for our Machine Learning
subject.
In this project, a version of the thyroid0387
dataset was analyzed, which is a subset of a dataset largest, the "Thyroid Disease dataset"
, containing data from patients with thyroid disease, including information demographics, thyroid hormone levels, and diagnostic information.
In this way, knowing the dataset, our tasks were to investigate the best possible models, which allow us to achieve our objectives, that is, to obtain with confidence the answer to whether we can predict the diagnosis, the age and sex of a subject, given the remaining attributes, as well as determining what the most significant features in the best models obtained.
That was implemented in python
using Jupyter Notebook
.
All the functionalities were successfully implemented, and we received a very high score
.
In order to analyze the dataset in question, we needed to process its data:
- Categorize the classes present in the target;
- Delete the Record Identification column;
- Replacement of "?" values;
- Replacing column types;
- Parse columns to int;
- Elimination of outliers associated with age;
- Elimination of the hypopituitary feature;
- Treatment of missing values in the TSH, T3, TT4, T4U, FTI, TBG columns;
- Performing a One Hot Encoder in the referral source column;
- Elimination of NaN’s;
- Eliminating duplicate lines.
In order to carry out an extensive analysis of the different models for objectives O1 and O2, we carried out a Forward Feature Selection
until the 8 best features of each model, allowing us a general analysis of each model with the features that bring them greater benefits, and is also justified as it allows to reduce the number of features in the model, making it simpler.
- K-Nearest Neighbors (KNeighborsClassifier)
- Support Vector Classifier (SVC)
- Guassian Naive Bayes (GaussianNB)
- Rede neuronal (MLPClassifier)
- Adaptive Boosting Ensemble (AdaBoostClassifier)
- Random Forest Ensemble (RandomForestClassifier)
- Árvores de decisão (DecisionTreeClassifier)
- eXtreme Gradient Boosting Ensemble (XGBoostClassifier)
- Random Forest (RandomForestRegressor)
- K-Nearest Neighbors (KNeighborsRegressor)
- Árvores de decisão (DecisionTreeRegressor)
- Regressão Linear (LinearRegression)
- Support Vector Regressor Linear (LinearSVR)
- Cross Validation Ridge (RidgeCV)
- Cross Validation Elastic Net (ElasticNetCV)
- Cross Validation Lasso (LassoCV)
Tuning was carried out for the 3 models chosen in the previous section, using a Grid Search
, selecting the appropriate hyperparameters for each model.