This repo is an analysis of patient length of stay that can help providers to improve patient outcomes and reduce costs. By leveraging data and advanced analytical techniques, healthcare providers can gain a deeper understanding of patient needs and identify opportunities to improve the delivery of care.
To predict the Length of stay(LOS) of each patient by implementing an ML model fed by data, which is feature-engineered as a result of EDA on a vast dataset of Hospital log history.
- Snowflake: Warehouse managament
- Sagemaker: To host notebook instance
- Knowledge on ML algorithms (classification and regression)
- Python (Pandas) and SQL
For Snowflake to host the data, create a database and table to stage and load data from local. Here's the script: create_db_and_table
Snowflake provides in-house visualization of the data weights when you select all columns from the table (select * from health_data
) for a quick overview. However, it's important to understand the behvaior of discharge_data
w.r.t one or more columns in the table. With logical reasoning, perform EDA on the database to identify patterns in any. Here's my take on EDA: snowflake_eda
From the insights captured from EDA, create new columns combining 2 or more correlating features. To train the models over the LOS, add a desginated LOS column. Here is the script: feature_engineering
Databases are good at querying and Pandas are good at data manipulation. The data is copied to pandas for preprocessing.
- Create a notebook instance on Sagemaker
- Import python-sql-connector and snowflake-sqlalchemy to load data from Snowflake to the python script
- Drop Irrelevant Rows
- Categorize columns into objects and integers
- Perform one-hot encoding on the object columns
To find the features with the highest impact on LOS, perform Dtree and XGB selections, and the resultant feature set is stored using Union and is used for building the ML model.
- Train the new features on Linear Regression, XGB Regression, Random Forest Regression. Hyperparameter tuning, voting, bagging, etc, are not performed as the focus is on demonstrating the workflow.
Now that we have the model ready, schedule a notebook to run the script on the preprocessed daily updated data from the Snowflake warehouse. Here is the script: scoring_and_scheduling. The script also sends email notification daily, whenever the LOS is predicted by the model. The predicted data is stored in a logging database on Snowflake for model retraining purposes.
The project gives an overall idea of performing feature engineering with SQL and feature selection using Pandas while creating an end-to-end data workflow, where the data is managed by Snowflake and compute resources by Sagemaker Notebook Instance.