-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathData_Jamboree_ggplot_instr_notes.Rmd
413 lines (301 loc) · 15.1 KB
/
Data_Jamboree_ggplot_instr_notes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
---
title: "Data Jamboree- ggplot2 with Presenter Notes"
output: html_document
---
#Data Jamboree INTRODUCTION
**Getting Started Slide will start on screen**
Alison:
- Introduce ggplot
- Goals for the day (learning four types of plots)
- Lead partnering up
- Installing packages and loading packages (put pdf on screen during this)
- Introduces NHANES dataset (available and accessible to the public)
#Begin Live Coding:
Alison:
- Remind people how to download R, RStudio, and packages
##Setup
###Import packages (once per machine)
```{r install_packages}
#install.packages("readr")
#install.packages("ggplot2")
#install.packages("ggbeeswarm")
#install.packages("dplyr")
#install.packages("MASS")
#install.packages("hexbin")
```
###Load packages (once per R session)
```{r load_packages}
library(readr)
library(ggplot2)
library(ggbeeswarm)
library(dplyr)
library(MASS)
library(hexbin)
```
###Import data (readr)
We will be using a little bit of R to get started. We want to **assign the data file to an R object using (<-)**. This imports our data and gives us the ability to call the data easily.
```{r import_data}
heart <- read_csv("http://faculty.washington.edu/kenrice/heartgraphs/nhaneslarge.csv", na=".") #na= tells R that . is an na value
```
To take a quick look at our data, **run the command head ( ) and check with your partner to see if you got the same output**. The number of columns will change based on the width of your screen, but you should get the same dimension label at the top.
```{r head}
head(heart)
```
**PASS TO JULIANNE**
# Univariate Plots: Folate intake by gender
## Basic ggplot Anatomy
Let's start using ggplot2 to make some simple graphs. The code we use for ggplot is made of **layers**. It creates graphs by **layering imformation on top of an empty graph.** Let's try the basic ggplot command to see how this works.
```{r basic_ggplot}
ggplot(heart, aes(x=DR1TFOLA))
#This says we want ggplot to use the data.frame heart and to plot DR1TFOLA on the x-axis.
```
We see that the command for ggplot does actually create a plot, but without any layers, it doesn't know how you want to plot your data.
## Histogram
Say that we want to make a histogram plot. We start with our ggplot command and **add a geom layer** by using the **+** sign.
*histogram definition: a diagram consisting of rectangles whose area is proportional to the frequency of a variable and whose width is equal to the class interval*
```{r histogram}
ggplot(heart, aes(x=DR1TFOLA)) +
geom_histogram()
```
So same as before we said "ggplot, use the data.frame heart, make a plot with DR1TFOLA on the x-axis and now plot the frequency of DR1TFOLA by using a histogram type plot on top". What a good listener ggplot is!
### Labels
Some of you may have been wondering what "DR1TFOLA" is. We find out by looking at the dataset that it means **folate intake**. Let's **update the x-axis label** with this information so we know what we are looking at.
```{r change_folate_label}
ggplot(heart, aes(x=DR1TFOLA)) +
geom_histogram() +
labs(x = "Folate intake") #x-axis label
```
### Changing Colors
One of ggplot's great features is that you can customize your plot to look exactly the way you picture it in your head. We can start with changing some colors.
First, we will change the outline of the bars by using **colour** in the **geom_histogram** layer.
```{r histogram_outline}
ggplot(heart, aes(x=DR1TFOLA)) +
geom_histogram(colour = "white") +
labs(x = "Folate intake")
```
We can also change the fill of the bars by using **fill** in the **geom_histogram** layer.
```{r histogram_fill}
ggplot(heart, aes(x=DR1TFOLA)) +
geom_histogram(colour = "white", fill = "peachpuff") + #yes, that is a name of a color
labs(x = "Folate intake")
```
### Documentation and the Help Command
Color settings will change a little bit for each type of graph. Make sure reference your documentation to make sure you're changing the right type of color. For example, here colour for outlines, and fill for inside color. **When in doubt, look up the documentation by typing "?command".**
```{r help}
?geom_histogram
```
### Histogram Bins and ggplot Defaults
Now, R has been bugging us about the number of bins in our histogram by telling us
```{r error}
# `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```
This just tells us that this is a default that ggplot chose for us. **Defaults are in most commands for ggplot. However, they are sensible, meaning that ggplot chose them because it believes the default value makes the most sense given the data.** It wants to make this whole plotting this really easy for you. You can always look up the documentation to see what the defaults are and change them to your liking. In fact, I recommend that.
For now, we will **change the amount of bins to 50**. Feel free to play around with this. You can change this to any number.
```{r histogram_bin}
ggplot(heart, aes(x=DR1TFOLA)) +
geom_histogram(colour = "white", fill = "peachpuff", bins = 50)
```
### Facet Wrap
Now that we know how to create a basic histogram, **we can add another dimension to our plot**. We can create separate graphs based on gender by adding something called a facet_wrap. We add in our facet wrap as a layer with the "+" sign once again.
*Facet_wrap description: Most displays are roughly rectangular, so if you have a categorical variable with many levels, it doesn't make sense to try and display them all in one row (or one column). To solve this dilemma, facet_wrap wraps a 1d sequence of panels into 2d, making best use of screen real estate.*
```{r histogram_facet}
ggplot(heart, aes(x = DR1TFOLA)) +
geom_histogram(colour = "white", fill = "peachpuff", bins = 50) +
labs(x = "Folate intake") +
facet_wrap(~gender)
```
## Density Plot
Say that we want our plot the exact same plot, but we want it to give us the kernal density instead.
Easy, **change the geom_histogram layer to geom_density**. Make sure to take out the bin specification as a density plot doesn't have visible bins.
*Kernal Density Estimation definition: kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. Kernel density estimation is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample.*
```{r density_facet}
ggplot(heart, aes(x = DR1TFOLA)) +
geom_density(colour = "white", fill = "peachpuff") +
labs(x = "Folate intake") +
facet_wrap(~gender)
```
### Adding a third dimension onto one plot
Alright, now that our plots are smoothed out, we may decide that we'd like to see these plots on the same graph. However, we want to keep all of the dimensions that we've been plotting out (folate, density, and gender). We can do this by **differentiating by color on the geom level in its aesthetics**. Its a bit complicated to explain, but much easier to understand when you see it. So let's try this out.
```{r density}
ggplot(heart, aes(x = DR1TFOLA)) +
geom_density(aes(colour = gender)) + #Notice moving the colour into aesthetics
labs(x = "Folate intake")
```
Remeber from before colour = the outline. So our density plots aren't filled yet. But ggplot very nicely gave us a **legend on the side**.
We can also make this exact same graph by adding colour into our main level aesthetics. We'll get into the difference between this in a little bit. For now, let's replicate our previous graph.
**NOTE TO ALISON: should we take this one out and you can explain it later with the more complicated graphs?**
```{r density_rep}
ggplot(heart, aes(x = DR1TFOLA, colour = gender)) +
geom_density() +
labs(x = "Folate intake")
```
# Univariate Plots: Systolic blood pressure by gender
##Dotplot - Stripchart
Alright, let's start our next plot type by exploring dot plots. We will create a simple stripchart of `BPXSAR`(or systolic blood pressure) by gender.
We'll start with the same basic structure from above, but this time we will do geom_ point instead of a geom_ histogram or a geom_ density.
```{r dotplot}
ggplot(heart, aes(x = gender, y = BPXSAR)) +
geom_point()
```
### Customizing our Dotplot
As you can see, there are way too many points to understand what this is trying to say. We can **add an alpha in the geom layer to make our points more transparent**. Alpha works on a scale from 0 (clear) to 1 (opaque).
```{r alpha}
ggplot(heart, aes(x = gender, y = BPXSAR)) +
geom_point(alpha = .1)
```
It's still a bit hard to interpret so we can add jitter to space out our points.
```{r jitter}
ggplot(heart, aes(x = gender, y = BPXSAR)) +
geom_jitter(alpha = .1)
```
Alright, so we can see all of our points now, but the start to get a little mixed up with each other. **Jitter automatically adds space to both the height and the width of your plots.**
First, we should think a bit about what jitter does? It spaces things out and can introduce some noise to our plot. **Since our x-axis is a categorical variable we won't be adding too much noise on our x axis, but our y-axis is continuous so jitter there will really mess up our graph.**
Feel free to play around with this to see how your graph changes.
```{r jitter_width}
ggplot(heart, aes(x = gender, y = BPXSAR)) +
geom_jitter(alpha = .1, width = .5, height = 0)
```
Now if we change our y-axis label we will have a readable plot! We are going to make x-axis blank by putting empty quotation marks because male and female is clear enough to not need the title gender. Just a bit more practice in customizing your plot.
```{r y_axis_bp}
ggplot(heart, aes(x = gender, y = BPXSAR)) +
geom_jitter(alpha = .1, width = .5, height = 0) +
labs(x = "", y = "Systolic BP (mmHg)")
```
### Beeswarm Plot
We're going to give a quick intro to the beeswarm plot. Once again, to avoid overlapping points we'll add some alpha.
*Beeswarm plot definiton: bee swarm plot is a one-dimensional scatter plot like "stripchart", but with closely-packed, non-overlapping points*
```{r beeswarm}
ggplot(heart, aes(x = gender, y = BPXSAR)) +
geom_beeswarm(alpha = .2) +
labs(x = "Systolic BP (mmHg)", y = "")
```
#### Stats layers
ggplot doesn't just make plots, **you can also add statistics on top of your plots**. Say we want to see where the mean is in our plots. We will add a new layer by using the command stat_summary. **Now the language here gets a bit more complicated but what we are telling R to do is to give us the mean, what type of geom it should be, and what color it should be.**
Play around with your colors too!
```{r beeswarm_mean}
ggplot(heart, aes(x = gender, y = BPXSAR)) +
geom_beeswarm(alpha = .2) +
stat_summary(fun.y = "mean", geom = "point", colour = "orange") +
labs(x = "Systolic BP (mmHg)", y = "")
```
### Layering geom types
Ok, say instead of adding statistics on our plot, we want another plot type in there. We can also add that as a layer by using the "+ geom" syntax. **We'll add a boxplot in our beeswarm plot to give ourselves more information about the shape of data.**
**Also, just a note to explain why we have outlier.shape = NA in our boxplot layer.** A boxplot includes outliers as points. However, since we've already plotted all our points in the beeswarm, we don't need to plot them again in our boxplot. We'll get rid of them by making them equal to NA.
```{r boxplot_beeswarm}
ggplot(heart, aes(x = gender, y = BPXSAR)) +
geom_boxplot(outlier.shape = NA) +
geom_beeswarm(alpha = .2) +
labs(x = "Systolic BP (mmHg)", y = "")
```
This brings us to the topic of **layer order**. ggplot plots its layers in the order that they are typed. For example, since our boxplot was typed before the beeswarm, the boxplot will be plotted first and the beeswarm will be plotted over it. Let's change the order to see how that effects our plot.
```{r beeswarm_boxplot}
ggplot(heart, aes(x = gender, y = BPXSAR)) +
geom_beeswarm(alpha = .2) +
geom_boxplot(outlier.shape = NA) +
labs(x = "Systolic BP (mmHg)", y = "")
```
Now that plot is difficult to read because the boxplot was put over our beeswarm.
## Extra Challenge
###Violin plot
If that was easy for you, give this a shot. Try and recreate the plots below using what you learned above.
*Hint: We are using "geom_violin".*
```{r violin, echo=FALSE}
ggplot(heart, aes(x = gender, y = BPXSAR)) +
geom_violin(alpha = .2) +
labs(x = "Systolic BP (mmHg)", y = "")
```
Try adding some statistics to your plot. You can start with the mean and the median.
```{r violin_stats, echo=FALSE}
ggplot(heart, aes(x = gender, y = BPXSAR)) +
geom_violin(alpha = .2) +
stat_summary(fun.y = "mean", geom = "point", colour = "orange") +
stat_summary(fun.y = "median", geom = "point", colour = "blue") +
labs(x = "Systolic BP (mmHg)", y = "")
```
Try adding another geom layer.
*Hint: Do you want to include your outlier points?*
```{r violin_boxplot, echo=FALSE}
ggplot(heart, aes(x = gender, y = BPXSAR)) +
geom_violin(alpha = .2) +
geom_boxplot(width = .05) +
labs(x = "Systolic BP (mmHg)", y = "")
```
**PASS TO ALISON**
# Bivariate Plots: Age & Systolic Blood Pressure
Simple scatterplot
```{r}
ggplot(heart, aes(x = RIDAGEYR, y = BPXSAR)) +
geom_point() +
labs(x = "Age (years)", y = "Systolic BP (mmHg)")
```
If you have big n, try hexbin plot
```{r}
ggplot(heart, aes(x = RIDAGEYR, y = BPXSAR)) +
geom_hex() +
labs(x = "Age (years)", y = "Systolic BP (mmHg)")
```
Add linear regression line with SE
```{r}
ggplot(heart, aes(x = RIDAGEYR, y = BPXSAR)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Age (years)", y = "Systolic BP (mmHg)")
```
Default is loess line
```{r}
ggplot(heart, aes(x = RIDAGEYR, y = BPXSAR)) +
geom_point() +
geom_smooth() +
labs(x = "Age (years)", y = "Systolic BP (mmHg)")
```
Add splines
```{r}
library(splines)
library(MASS)
ggplot(heart, aes(x = RIDAGEYR, y = BPXSAR)) +
geom_point() +
stat_smooth(method = "lm", formula = y ~ ns(x, 3)) +
labs(x = "Age (years)", y = "Systolic BP (mmHg)")
```
# Multivariable Plots: Body Mass Index & Systolic BP, by gender and age
Just copy this:
```{r}
library(dplyr)
heart2 <- heart %>%
mutate(age_cat = cut(RIDAGEYR,c(0,30,55,100)))
```
Recreate theirs first
```{r}
ggplot(heart2, aes(x = BMXBMI, y = BPXSAR)) +
geom_point() +
stat_smooth(aes(colour = gender), method = "lm") +
facet_wrap(~age_cat) +
labs(x = "Body Mass Index"~(kg/m^2), y = "Systolic BP (mmHg)")
```
Try with facet grid, update labels
```{r}
ggplot(heart2, aes(x = BMXBMI, y = BPXSAR)) +
geom_point() +
stat_smooth(aes(colour = gender), method = "lm") +
facet_grid(gender~age_cat) +
labs(x = "Body Mass Index"~(kg/m^2), y = "Systolic BP (mmHg)")
```
Play with colors!
```{r}
ggplot(heart2, aes(x = BMXBMI, y = BPXSAR, colour = gender)) +
geom_point(alpha = .5) +
stat_smooth(method = "lm") +
facet_grid(gender~age_cat) +
theme_minimal() +
labs(x = "Body Mass Index"~(kg/m^2), y = "Systolic BP (mmHg)") +
scale_color_manual(values = c("#B47CC7", "#D65F5F"), guide = FALSE)
```
```{r}
my_colors <- c("#C4AD66", "#77BEDB")
ggplot(heart2, aes(x = BMXBMI, y = BPXSAR, colour = gender)) +
geom_point(alpha = .5) +
stat_smooth(method = "lm") +
facet_grid(gender~age_cat) +
labs(x = "Body Mass Index"~(kg/m^2), y = "Systolic BP (mmHg)") +
scale_color_manual(values = my_colors, guide = FALSE)
```