This repository has been archived by the owner on Jan 8, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdatafest_data_cleaning_and_EDA.Rmd
217 lines (167 loc) · 6.57 KB
/
datafest_data_cleaning_and_EDA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
---
title: "Data aggregation, cleaning, and EDA"
output:
html_document:
df_print: paged
html_notebook: default
pdf_document: default
date: "April 14th 2020"
---
```{r setup, include=FALSE}
library(fs) #file utils
library(lubridate) # date utils
library(DT) # pretty interactive data tables
library(tidyverse)
```
## Datasets used:
(Obviously can be adjusted)
Arizona:
- PM 2.5 data for 2020
- Carbon Monoxide data for 2020
- Covid cases by county
California:
- PM 2.5 data for 2020
- Carbon Monoxide data for 2020
- Covid cases by county
### Initial import & cleaning (Note that many locations lack PM 2.5 *and* CO data):
```{r datasets}
data_directory <- "/Users/elizabethhaderer/Documents/github/duke_datafest_covid19/datasets" # change appropriately
# Covid data from NYT: https://github.com/nytimes/covid-19-data
covid_by_county_url <- "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
covid_by_county_dfs <- read_csv(url(covid_by_county_url)) %>%
filter(state %in% c("California", "Arizona")) %>%
mutate_if(is.character, toupper) %>%
rename_all(toupper) %>%
group_by(STATE) %>% # split by state to be able to combine w/EPA data more easily
group_split()
# Arizona EPA data
az_files <- fs::dir_ls(data_directory, regexp = "ad_viz_plotval_data_AZ*")
az_df <- az_files %>%
map(read_csv) %>%
map(~ .x %>%
mutate_if(is.character, toupper)) %>%
reduce(full_join,
by=c("Date","STATE_CODE", "STATE", "COUNTY_CODE", "COUNTY", "Site ID", "Site Name", "SITE_LATITUDE", "SITE_LONGITUDE", "CBSA_CODE", "CBSA_NAME"),
suffix = c("_co", "_pm25")) %>%
rename_all(toupper) %>%
mutate_at("DATE", function(x) as.Date(x, format = "%m/%d/%Y")) %>%
left_join(covid_by_county_dfs[[1]], by=c("DATE", "STATE", "COUNTY"))
# California EPA data
ca_files <- fs::dir_ls(data_directory, regexp = "ad_viz_plotval_data_CA*")
ca_df <- ca_files %>%
map(read_csv) %>%
map(~ .x %>%
mutate_if(is.character, toupper)) %>%
reduce(full_join,
by=c("Date","STATE_CODE", "STATE", "COUNTY_CODE", "COUNTY", "Site ID", "Site Name", "SITE_LATITUDE", "SITE_LONGITUDE", "CBSA_CODE", "CBSA_NAME"),
suffix = c("_co", "_pm25")) %>%
rename_all(toupper) %>%
mutate_at("DATE", function(x) as.Date(x, format = "%m/%d/%Y")) %>%
left_join(covid_by_county_dfs[[2]], by=c("DATE", "STATE", "COUNTY"))
```
### Look @ data sets:
```{r}
# A simplified list of columns to look at. I don't know what a lot of them mean so this is just a guess.
relevant_initial_cols <- c("DATE", "STATE", "COUNTY", "SITE NAME", "DAILY MAX 8-HOUR CO CONCENTRATION", "DAILY MEAN PM2.5 CONCENTRATION", "CASES", "DEATHS")
#Note. According to [this site](covid19.healthdata.org), California's stay at home order was March 19th, 2020.
ca_stay_at_home_start <- as.Date("2020-03-19")
```
```{r}
az_df %>%
select(relevant_initial_cols) %>%
filter(DATE > ca_stay_at_home_start) %>%
datatable(filter="top",
caption="AZ data on CO, PM 2.5, & COVID Cases/Deaths")
```
```{r}
ca_df %>%
select(relevant_initial_cols) %>%
filter(DATE > ca_stay_at_home_start) %>%
datatable(filter="top",
caption="CA data on CO, PM 2.5, & COVID Cases/Deaths")
```
##Comparing Pre & Post Shut Down
California Shut Down Date: 03/19/2020
Pre Shut Down Date Subset:03/05/2020-3/19/2020
Post Shut Down Date Subset : 3/20/2020-4/4/2020
Assumptions: PM 2.5 measurements are independent, and each day's mean is a representative sample for that region so we will treat as a random sample.
Arizona
```{r az}
#Subset Pre & Post and calculate mean PM 2.5 for that date range
az_pre <- subset(az_df, DATE > "2020-02-18" & DATE < "2020-03-20")
az_pre_mean <- mean(az_pre$`DAILY MEAN PM2.5 CONCENTRATION`)
az_post <- subset(az_df, DATE> "2020-03-19" & DATE < "2020-04-21")
az_post_mean <- mean(az_post$`DAILY MEAN PM2.5 CONCENTRATION`)
#Conduct a t.test to see if there is a significant difference
var.test(az_pre$`DAILY MEAN PM2.5 CONCENTRATION`, az_post$`DAILY MEAN PM2.5 CONCENTRATION`)
t.test(az_pre$`DAILY MEAN PM2.5 CONCENTRATION`, az_post$`DAILY MEAN PM2.5 CONCENTRATION`, var.equal = FALSE)
```
California
```{r ca}
#Subset Pre & Post PM 2.5 for that date range
ca_pre <- subset(ca_df, DATE > "2020-02-18" & DATE < "2020-03-20")
ca_pre_mean <- mean(ca_pre$`DAILY MEAN PM2.5 CONCENTRATION`)
ca_post <- subset(ca_df, DATE> "2020-03-19" & DATE < "2020-04-21")
ca_post_mean <- mean(ca_post$`DAILY MEAN PM2.5 CONCENTRATION`)
#Conduct a t.test to see if there is a significant difference
var.test(ca_pre$`DAILY MEAN PM2.5 CONCENTRATION`, ca_post$`DAILY MEAN PM2.5 CONCENTRATION`)
t.test(ca_pre$`DAILY MEAN PM2.5 CONCENTRATION`, ca_post$`DAILY MEAN PM2.5 CONCENTRATION`, var.equal = FALSE)
```
### Animation of PM 2.5 values in CA and AZ
```{r}
### Setup for animation ###
library(maps)
library(ggmap)
library(mapdata)
library(ggthemes)
library(gganimate)
library(gifski)
library(viridis)
# Add more steps to color scale
custom_colors <- viridis(20)
combined_ca_az <- az_df %>%
full_join(ca_df) %>%
select(DATE,SITE_LONGITUDE, SITE_LATITUDE, `DAILY MEAN PM2.5 CONCENTRATION`, STATE) %>%
filter(`DAILY MEAN PM2.5 CONCENTRATION` < 50) # remove outlier points just from visualization because they skew the color scale too much
# Define plot
combined_map_plot <- ggplot() +
geom_polygon(data=az_and_ca, aes(x=long, y = lat, group = group),
fill="lightgrey", color="white") +
coord_fixed(1.3) +
geom_point(data= combined_ca_az,
aes(x=SITE_LONGITUDE, y=SITE_LATITUDE,
size=40,
fill=`DAILY MEAN PM2.5 CONCENTRATION`),
colour="black",
alpha=0.8,
pch=21) +
scale_fill_gradientn(colours=custom_colors,
name = "Daily mean PM 2.5 Concentration",
guide = guide_colorbar(
direction = "horizontal",
barheight = unit(5, units = "mm"),
barwidth = unit(50, units = "mm"),
draw.ulim = F,
title.position = 'top',
# some shifting around
title.hjust = 0.5,
label.hjust = 0.5
)) +
labs(title="PM 2.5 levels: {closest_state}") +
theme_map() +
guides(size=FALSE) +
theme(legend.position="bottom",
plot.title = element_text(size=27),
legend.title = element_text(size=16),
legend.text = element_text(size=16)) +
transition_states(DATE, transition_length=1, state_length =1)
# animate plot by date
animate(combined_map_plot,
nframes=300,
renderer=gifski_renderer())
```
```{r}
# Save plot to gif
anim_save("ca_and_az_animation.gif",
animation = last_animation())
```