Skip to content

Commit

Permalink
updated complex variable creation
Browse files Browse the repository at this point in the history
  • Loading branch information
kstreet13 committed Sep 26, 2024
1 parent aec5528 commit 473134e
Show file tree
Hide file tree
Showing 4 changed files with 153 additions and 16 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
// Pandoc 2.9 adds attributes on both header and div. We remove the former (to
// be compatible with the behavior of Pandoc < 2.8).
document.addEventListener('DOMContentLoaded', function(e) {
var hs = document.querySelectorAll("div.section[class*='level'] > :first-child");
var i, h, a;
for (i = 0; i < hs.length; i++) {
h = hs[i];
if (!/^h[1-6]$/i.test(h.tagName)) continue; // it should be a header h1-h6
a = h.attributes;
while (a.length > 0) h.removeAttribute(a[0].name);
}
});
36 changes: 30 additions & 6 deletions website/static/slides/05-data-wrangling/slides.Rmd
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: "Week 5: Data Wrangling"
subtitle: "PM 566: Introduction to Health Data Science"
author: "George G. Vega Yon and Kelly Streets"
author: "George G. Vega Yon and Kelly Street"
# output:
# slidy_presentation:
# slide_level: 1
Expand Down Expand Up @@ -660,15 +660,16 @@ dat |>

Don't forget about loops! `for` loops and `sapply` may be slow on a dataset of this size, but they can be quite handy for creating variables that rely on complicated relationships between variables. Consider this a "brute force" approach. Vectorized methods will *always* be faster, but these can be easier to conceptualize and, in rare cases, may be the only option.

Let's demonstrate this by creating a weird variable: `wind.temp`. This will take on 4 possible values, based on the temperature and wind speed: cool & still, cool & windy, warm & still, or warm & windy. We will split each variable based on their median value.
Consider the problem creating a weird variable: `wind.temp`. This will take on 4 possible values, based on the temperature and wind speed: cool & still, cool & windy, warm & still, or warm & windy. We will split each variable based on their median value. Note that this code is too slow to actually run on this large dataset.

---

## Complex variable creation (cont 1)

Here's how we would do that with the `sapply` function:
Here's how we would do that with the `sapply` function (and a custom, unnamed function):

```{r new-var-sapply}
```{r new-var-sapply, eval=FALSE}
# create the new variable one entry at a time
wind.temp <- sapply(1:nrow(dat), function(i){
if(is.na(dat$temp[i]) | is.na(dat$wind.sp[i])){
return(NA)
Expand All @@ -687,8 +688,8 @@ wind.temp <- sapply(1:nrow(dat), function(i){
}
}
})
head(wind.temp)
```

Check: what would we need to change to add this variable to our dataset?

---
Expand All @@ -700,7 +701,7 @@ Here's the code for doing that with a `for` loop:
```{r new-var-for-loop, eval=FALSE}
# initialize a variable of all missing values
wind.temp <- rep(NA, nrow(dat))
# fill in the values
# fill in the values one at a time
for(i in 1:nrow(dat)){
if(is.na(dat$temp[i]) | is.na(dat$wind.sp[i])){
return(NA)
Expand All @@ -724,6 +725,29 @@ for(i in 1:nrow(dat)){

Check: why do we need to include `na.rm=TRUE` when calculating the medians?

---

## Complex variable creation (cont 3)

Here's a simple vectorized approach that will actually run on a large dataset. This works for our current case, but it's still a brute force approach, because we had to specifically assign every possible value of our new variable. You can imagine that as the number of possible values increases, this code will get increasingly cumbersome.

```{r new-var-subset}
# initialize a variable of all missing values
wind.temp <- rep(NA, nrow(dat))
# assign every possible value by subsetting
wind.temp[dat$temp <= median(dat$temp, na.rm=TRUE) &
dat$wind.sp <= median(dat$wind.sp, na.rm=TRUE)] <- 'cool & still'
wind.temp[dat$temp <= median(dat$temp, na.rm=TRUE) &
dat$wind.sp > median(dat$wind.sp, na.rm=TRUE)] <- 'cool & windy'
wind.temp[dat$temp > median(dat$temp, na.rm=TRUE) &
dat$wind.sp <= median(dat$wind.sp, na.rm=TRUE)] <- 'warm & still'
wind.temp[dat$temp > median(dat$temp, na.rm=TRUE) &
dat$wind.sp > median(dat$wind.sp, na.rm=TRUE)] <- 'warm & windy'
head(wind.temp)
```


---

## Merging data
Expand Down
121 changes: 111 additions & 10 deletions website/static/slides/05-data-wrangling/slides.html
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,8 @@
<head>
<title>Week 5: Data Wrangling</title>
<meta charset="utf-8" />
<meta name="author" content="George G. Vega Yon and Kelly Streets" />
<meta name="date" content="2021-09-23" />
<script src="libs/header-attrs-2.27/header-attrs.js"></script>
<meta name="author" content="George G. Vega Yon and Kelly Street" />
<script src="libs/header-attrs-2.28/header-attrs.js"></script>
<link href="libs/remark-css-0.0.1/default.css" rel="stylesheet" />
<link rel="stylesheet" href="theme.css" type="text/css" />
</head>
Expand All @@ -20,10 +19,7 @@
## PM 566: Introduction to Health Data Science
]
.author[
### George G. Vega Yon and Kelly Streets
]
.date[
### 2021-09-23
### George G. Vega Yon and Kelly Street
]

---
Expand Down Expand Up @@ -64,7 +60,7 @@

## Disclaimer

There's a lot of extraneous information in these slides! While the `data.table` package and Python both have a lot of useful functionality, we strongly recommend sticking to the base R and `tidyverse` tools presented here. Slides covering material outside this scope will be marked with an asterisk (`*`).
There's a lot of extraneous information in these slides! While the `data.table` package and Python both have a lot of useful functionality, we strongly recommend sticking to the base R and `tidyverse` tools presented here. Slides covering material outside this scope will be marked with an asterisk (`*`); you should be extremely cautious about using code from those slides!

---

Expand Down Expand Up @@ -66992,8 +66988,11 @@


``` r
# Data.table
dat &lt;- dat |&gt; select(USAFID, WBAN, year, month, day, hour, min, lat, lon, elev, wind.sp, temp, atm.press)
# select only the relevant variables
dat &lt;- dat |&gt;
select(USAFID, WBAN, year, month, day,
hour, min, lat, lon, elev,
wind.sp, temp, atm.press)
```

---
Expand Down Expand Up @@ -96114,6 +96113,108 @@
## 4 2.374970 5.495707 249.2077
```

---

## Complex variable creation

Don't forget about loops! `for` loops and `sapply` may be slow on a dataset of this size, but they can be quite handy for creating variables that rely on complicated relationships between variables. Consider this a "brute force" approach. Vectorized methods will *always* be faster, but these can be easier to conceptualize and, in rare cases, may be the only option.

Consider the problem creating a weird variable: `wind.temp`. This will take on 4 possible values, based on the temperature and wind speed: cool &amp; still, cool &amp; windy, warm &amp; still, or warm &amp; windy. We will split each variable based on their median value. Note that this code is too slow to actually run on this large dataset.

---

## Complex variable creation (cont 1)

Here's how we would do that with the `sapply` function (and a custom, unnamed function):


``` r
# create the new variable one entry at a time
wind.temp &lt;- sapply(1:nrow(dat), function(i){
if(is.na(dat$temp[i]) | is.na(dat$wind.sp[i])){
return(NA)
}
if(dat$temp[i] &lt;= median(dat$temp, na.rm=TRUE)){
if(dat$wind.sp[i] &lt;= median(dat$wind.sp, na.rm=TRUE)){
return('cool &amp; still')
}else{
return('cool &amp; windy')
}
}else{
if(dat$wind.sp[i] &lt;= median(dat$wind.sp, na.rm=TRUE)){
return('warm &amp; still')
}else{
return('warm &amp; windy')
}
}
})
```

Check: what would we need to change to add this variable to our dataset?

---

## Complex variable creation (cont 2)

Here's the code for doing that with a `for` loop:


``` r
# initialize a variable of all missing values
wind.temp &lt;- rep(NA, nrow(dat))
# fill in the values one at a time
for(i in 1:nrow(dat)){
if(is.na(dat$temp[i]) | is.na(dat$wind.sp[i])){
return(NA)
}else{
if(dat$temp[i] &lt;= median(dat$temp, na.rm=TRUE)){
if(dat$wind.sp[i] &lt;= median(dat$wind.sp, na.rm=TRUE)){
wind.temp[i] &lt;- 'cool &amp; still'
}else{
wind.temp[i] &lt;- 'cool &amp; windy'
}
}else{
if(dat$wind.sp[i] &lt;= median(dat$wind.sp, na.rm=TRUE)){
wind.temp[i] &lt;- 'warm &amp; still'
}else{
wind.temp[i] &lt;- 'warm &amp; windy'
}
}
}
}
```

Check: why do we need to include `na.rm=TRUE` when calculating the medians?

---

## Complex variable creation (cont 3)

Here's a simple vectorized approach that will actually run on a large dataset. This works for our current case, but it's still a brute force approach, because we had to specifically assign every possible value of our new variable. You can imagine that as the number of possible values increases, this code will get increasingly cumbersome.


``` r
# initialize a variable of all missing values
wind.temp &lt;- rep(NA, nrow(dat))
# assign every possible value by subsetting
wind.temp[dat$temp &lt;= median(dat$temp, na.rm=TRUE) &amp;
dat$wind.sp &lt;= median(dat$wind.sp, na.rm=TRUE)] &lt;- 'cool &amp; still'
wind.temp[dat$temp &lt;= median(dat$temp, na.rm=TRUE) &amp;
dat$wind.sp &gt; median(dat$wind.sp, na.rm=TRUE)] &lt;- 'cool &amp; windy'
wind.temp[dat$temp &gt; median(dat$temp, na.rm=TRUE) &amp;
dat$wind.sp &lt;= median(dat$wind.sp, na.rm=TRUE)] &lt;- 'warm &amp; still'
wind.temp[dat$temp &gt; median(dat$temp, na.rm=TRUE) &amp;
dat$wind.sp &gt; median(dat$wind.sp, na.rm=TRUE)] &lt;- 'warm &amp; windy'

head(wind.temp)
```

```
## [1] "warm &amp; windy" "warm &amp; windy" "warm &amp; windy" "warm &amp; windy" "warm &amp; still"
## [6] "warm &amp; still"
```


---

## Merging data
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 473134e

Please sign in to comment.