output | ||||
---|---|---|---|---|
|
Code: CHEP 898
Term: 2025 Winter
Delivery: In person
Location: HLTH 3312
Start Date: January 8th, 2025
Time: Wednesdays 9:00- 11:50 am
This course introduces students to the principles of data science as applied to epidemiological research. Emphasis is on data wrangling, version control with Git and GitHub, high-performance computing, and machine learning techniques. It also compares traditional epidemiologic analysis approaches with contemporary machine learning methods.
The official syllabus for this course is available for download here
I acknowledge our shared connection to the land and recognize that Indigenous and Métis peoples on Treaty 6 Territory and all Indigenous peoples have been and continue to be stewards for social justice, equity, and land-based education. In the spirit of reconciliation may we all strive to learn and support the work of Indigenous communities as allies.
This course will follow the general USask Guidelines about AI for Educators and Students (https://leadership.usask.ca/initiatives/ai/index.php). The University has developed high level guidance based on the European Network for Academic Integrity (ENAI) recommendations. The principles are descriptions of USask intentions for, and beliefs about, the use of AI. They include 4 categories: • Ethical and Responsible Use • Literacy • Tool Use • Change and Innovation
In general, my opinion is that you should exploring these tools, what they can do, and how you can integrate them into your work. These tools are great for editing, formatting, generating ideas, and writing very basic code. USask faculty and students have access to Microsoft Co-Pilot (https://teaching.usask.ca/learning-technology/tools/microsoft-copilot.php). It's critical that when you use these tools you are very aware of bias and that you intervene to correct the text. Here are my general rules for AI in this course.
-
You can use AI tools for any or all parts of the work.
-
If you do you must cite your work (as above).
2.1. Acknowledge AI tools: “All persons, sources, and tools that influence the ideas or generate the content should be properly acknowledged” (p. 3). Acknowledgement may be done in different ways, according to context and discipline, and should include the input to the tool.
2.2. Do not list AI tools as authors: Authors must take responsibility and be accountable for content and an AI tool cannot do so.
2.3. Recognize limits and biases of AI tools: Inaccuracies, errors, and bias are reproduced in AI tools in part because of the human produced materials used for training.
-
If you do you must include a 500 word reflective essay about the experience as part of your self-evaluation.
-
Be very careful with reference. Many of these tools just make up random references.
-
I will not use tools like GPTZero to detect whether you have used AI tools or not. We are making an agreement to be honest with each other here. This is small class. We have that luxury.
Dr. Daniel Fuller [email protected]
- Understand the basics of data wrangling and data management in epidemiology.
- Gain proficiency in using Git and GitHub for version control.
- Learn to leverage high-performance computing resources for epidemiologic data analysis.
- Explore various machine learning techniques and their applications in epidemiology.
- Compare and contrast traditional epidemiological analysis methods with machine learning approaches.
There is not one textbook for this course. We will use various components of different open access resources.
R for Data Science (2e). 2024. Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. https://r4ds.hadley.nz/
An Introduction to Statistical Learning with Applications in R (2e). 2024. Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. https://www.statlearning.com/
Learn Tidymodels. https://www.tidymodels.org/learn/
Use of a statistical software program (R) is required for this course. You will also be asked to install other software including PostGRES (SQL) and Git.
In this course we will use the CanPath Student Dataset that provides students the unique opportunity to gain hands-on experience working with CanPath data. The CanPath Student Dataset is a synthetic dataset that was manipulated to mimic CanPath’s nationally harmonized data but does not include or reveal actual data of any CanPath participants.
The CanPath Student Dataset is available to instructors at a Canadian university or college for use in an academic course, at no cost. CanPath will provide the Student Dataset and a supporting data dictionary.
- Large sample size (Over 40,000 participants)
- Real-world population-level Canadian data
- Variety of areas of information allowing for a wide range of research topics
- No cost to faculty
- Potential for students to apply for real CanPath data to publish their findings
Week | Date | Topic | Data Work | Assignment Due |
---|---|---|---|---|
1 | January 8 | Intro to Data Science | Intro R + Data Wrangling | |
2 | January 15 | R Wrangling and Visualization | Data Visualization | |
3 | January 21 | Version Control with Git/Github | HappyGitwithR | Data Wrangling |
4 | January 29 | Missing Data | Missing Data | Git/Github |
5 | February 5 | Linear Regression | Linear Regression | Missing Data |
6 | February 12 | Logistic Regression | Logistic Regression | |
7 | February 19 | Reading Week | ||
8 | February 26 | Scientific Computing | Scientific Computing/Big Data | Independent Analysis 1 |
9 | March 5 | Causal Inference | Causal Quartet | Scientific Computing |
10 | March 12 | Support Vector Machines | Random Forest | |
11 | March 19 | Random Forest | Matching | Random Forest |
12 | March 26 | Matching Methods | SVM | Matching |
13 | April 2 | Artificial Neural Networks | ANN | Independent Analysis 2 |
- Subject to change depending on speed
Attendance and participation and reading ahead are critical to this course. There will a lot of time for discussion and working on assignments allocated in this course but reading ahead is a critical aspect of the learning process.
You can find the detailed descriptions for all assignments below or in the assignments folder here
Assignment | Grade % |
---|---|
Data Wrangling and Visualization | 10% |
Github | 10% |
Missing Data | 15% |
Independent Analysis - Part 1 | 10% |
Random Forest | 15% |
Scientific Computing/Big Data | 10% |
Matching | 15% |
Independent Analysis – Part 2 | 15% |
Total | 100% |
Value: 10% of final grade
Description: In this assignment you will complete a data wrangling assignment that will involve data cleaning, descriptive statistics, understanding missing data, and joining datasets together.
Value: 10% of final grade
Description: In this assignment you will create a Github account, install Git on your local computer, create a Github repository and commit and push your work to that Github repository.
Value: 15% of final grade
Description: In this assignment you will apply and compare different methods for imputing missing data on large health administrative dataset.
Value: 10% of final grade
Description: This is part 1 of the independent analysis. You will need to find a dataset, develop an analysis plan to includes the major components of the course (ie., Github, Scientific Computing), and conduct descriptive statistics and data wrangling on your chosen dataset.
Value: 10% of final grade
Description: In this analysis you will complete an Random Forest analysis using the Can Path student dataset. You will need to run the analysis, conduct detailed hyperparameter tuning, and conduct model comparisons.
Value: 10% of final grade
Description: In this assignment you will use the USask Plato High Performance Computing to run a large scale machine learning on a large (~1GB) dataset.
Value: 15% of final grade
Description: In this analysis you will complete an machine learning based matching analysis using the Can Path student dataset.
Value: 15% of final grade
Description: This is part 2 (final part) of the independent analysis. You will need to conduct a complete analysis including data wrangling, missing data handling, and apply at least 2 different machine learning methods to your data.
Type: Written report Description: Complete the student self-evaluation form.
All assignments should be submitted to the appropriate place in Canvas or Github. All assignments are due at 5pm (CST) on the due date. Please don’t stay up until midnight to get the work done. Remember there are no late penalties so just take an extra day if you need and get some sleep.
There is no penalty for late assignments. However, because many assignments have two parts, it is critical to the first assignment of the sections in around the due date. Missing assignments that are not submitted by the end of the course will receive a grade of zero.