In this project, we aim to build a predictive model for telecom churn prediction using multivariate logistic regression. The dataset contains information on various customer attributes, including demographics, services availed, and expenses. The primary goal is to predict whether a customer will churn or not, where 'Churn' is a binary variable: 1 denotes that the customer has churned, and 0 denotes that the customer has not churned.
S.No. | Variable Name | Meaning |
---|---|---|
1 | CustomerID | The unique ID of each customer |
2 | Gender | The gender of a person |
3 | SeniorCitizen | Whether a customer can be classified as a senior citizen |
4 | Partner | If a customer is married/ in a live-in relationship |
5 | Dependents | If a customer has dependents (children/ retired parents) |
6 | Tenure | The time for which a customer has been using the service |
7 | PhoneService | Whether a customer has a landline phone service along with the internet service |
8 | MultipleLines | Whether a customer has multiple lines of internet connectivity |
9 | InternetService | The type of internet services chosen by the customer |
10 | OnlineSecurity | Specifies if a customer has online security |
11 | OnlineBackup | Specifies if a customer has online backup |
12 | DeviceProtection | Specifies if a customer has opted for device protection |
13 | TechSupport | Whether a customer has opted for tech support or not |
14 | StreamingTV | Whether a customer has an option of TV streaming |
15 | StreamingMovies | Whether a customer has an option of Movie streaming |
16 | Contract | The type of contract a customer has chosen |
17 | PaperlessBilling | Whether a customer has opted for paperless billing |
18 | PaymentMethod | Specifies the method by which bills are paid |
19 | MonthlyCharges | Specifies the money paid by a customer each month |
20 | TotalCharges | The total money paid by the customer to the company |
21 | Churn | This is the target variable which specifies if a customer has churned or not |
The target variable is 'Churn,' indicating whether a particular customer has churned or not. It is a binary variable:
- 1:- Customer has churned
- 0:- Customer has not churned
The objective is to develop a robust Multivariate logistic regression model that can accurately predict customer churn based on historical data. By analyzing past information, the model will be trained to identify patterns and relationships between customer attributes and the likelihood of churn.
-
Data Exploration:
- Explore and understand the dataset.
- Check for missing values, outliers, and data distribution.
-
Data Preprocessing:
- Handle missing values and outliers.
- Encode categorical variables.
- Scale or normalize numerical features.
-
Model Building:
- Split the dataset into training and testing sets.
- Build a multivariate logistic regression model using the training data.
-
Model Evaluation:
- Evaluate the model's performance on the testing set.
- Analyze key metrics such as accuracy, precision, recall, and F1 score.
-
Model Interpretation:
- Interpret the coefficients of the logistic regression model to understand the impact of each feature on the likelihood of churn.
-
Data Import and Merging:-
- Imported necessary libraries, including Pandas and NumPy.
- Loaded multiple datasets related to telecom customer information.
- Merged the datasets using the 'customerID' as a common key.
-
Data Inspection and Preparation:-
- Explored the head, dimensions, and statistical aspects of the merged dataset.
- Converted binary variables ('Yes/No') to numeric (0/1).
- Created dummy variables for categorical features using one-hot encoding.
- Handled missing values by removing observations with missing 'TotalCharges'.
- Checked for outliers in continuous variables ('tenure', 'MonthlyCharges', 'SeniorCitizen', 'TotalCharges').
-
Train-Test Split:-
- Split the dataset into training and testing sets using the
train_test_split
method.
- Split the dataset into training and testing sets using the
-
Feature Scaling:-
- Used StandardScaler to scale numerical features ('tenure', 'MonthlyCharges', 'TotalCharges').
-
Correlation Analysis:-
- Explored the correlation between different features.
- Dropped highly correlated dummy variables to avoid multicollinearity.
-
Logistic Regression Model Building:-
- Used StatsModels' Generalized Linear Model (GLM) for logistic regression.
- Iteratively performed feature selection using Recursive Feature Elimination (RFE).
- Checked for Variance Inflation Factors (VIF) to identify and remove multicollinearity.
- Checked confusion matrices, accuracy, sensitivity, specificity, precision, and recall.
-
ROC Curve and AUC:-
- Plotted the Receiver Operating Characteristic (ROC) curve.
- Calculated the Area Under the Curve (AUC) for model evaluation.
-
Optimal Cutoff Point:-
- Determined the optimal cutoff probability using accuracy, sensitivity, and specificity.
- Adjusted the predicted probabilities based on the chosen cutoff.
-
Precision and Recall:-
- Calculated precision and recall for model evaluation.
- Explored the precision-recall tradeoff.
-
Making Predictions on Test Set:-
- Applied the trained model on the test set.
- Explored different probability cutoffs and assessed accuracy, sensitivity, and specificity.
-
Logistic Regression:-
- Used for binary classification problems (Churn/Not Churn).
-
One-Hot Encoding:-
- Technique to convert categorical variables into binary (dummy) variables.
-
Recursive Feature Elimination (RFE):-
- Method for selecting features by recursively removing the least important ones.
-
Receiver Operating Characteristic (ROC) Curve:-
- Graphical representation of the tradeoff between sensitivity and specificity.
-
Area Under the Curve (AUC):-
- Measures the area under the ROC curve, indicating model performance.
-
Variance Inflation Factor (VIF):-
- Checks for multicollinearity in the regression model.
-
Confusion Matrix:-
- Table used for evaluating the performance of a classification model.
-
Precision and Recall:-
- Metrics for evaluating the predictive power of a model in binary classification.
-
Feature Scaling (StandardScaler):-
- Normalizes numerical features for better convergence in logistic regression.
-
Train-Test Split:-
- Divides the dataset into training and testing subsets for model evaluation.
-
Dummy Variables:-
- Binary columns created to represent categorical data.
-
Open your terminal or command prompt.
-
Navigate to the directory where you want to store the project:
cd path/to/your/directory
-
Clone the repository:
git clone https://github.com/yashksaini-coder/Multivariate-Logistic-Regression---Telecom-Churn
cd telecom-churn-prediction
python -m venv venv
- For Windows:
.\venv\Scripts\activate
- For macOS/Linux:
source venv/bin/activate
pip install -r requirements.txt
jupyter-notebook
This will open a new tab in your web browser showing the Jupyter Notebook interface. Navigate to the cloned project directory and open the notebook titled Logistic-Regression.ipynb
.
Run each cell in the notebook sequentially. Ensure that you have the necessary datasets (churn_data.csv
, customer_data.csv
, internet_data.csv
, and Dictionary.csv
) in the Datasets
folder.