Render toc-less

fhdsl · Aug 22, 2024 · 0d647cf · 0d647cf
1 parent 0ea5529
commit 0d647cf
Show file tree

Hide file tree

Showing 10 changed files with 121 additions and 55 deletions.
diff --git a/docs/no_toc/02-data-structures.md b/docs/no_toc/02-data-structures.md
@@ -71,16 +71,14 @@ If you want to access everything but the first three elements of `chrNum`:
 
 
 ``` python
-chrNum[3:len(chrNum)]
+chrNum[3:]
 ```
 
 ```
 ## [2, 2]
 ```
 
-where `len(chrNum)` is the length of the list.
-
-When the start or stop index is *not* specified, it implies that you are subsetting starting the from the beginning of the list or subsetting to the end of the list, respectively:
+Here, the stop index number was not specificed. When the start or stop index is *not* specified, it implies that you are subsetting starting the from the beginning of the list or subsetting to the end of the list, respectively:
 
 
 ``` python
@@ -99,7 +97,7 @@ chrNum[3:]
 ## [2, 2]
 ```
 
-More discussion of list slicing can be found [here](https://stackoverflow.com/questions/509211/how-slicing-in-python-works).
+There are other popular uses of the slice operator `:`, such as negative indicies to count from the end of a list, or subsetting with a fixed increment. You can find more discussion of list slicing [here](https://wesmckinney.com/book/python-builtin#list_slicing).
 
 ## Objects in Python
 
@@ -111,7 +109,7 @@ The list data structure has an organization and functionality that metaphoricall
 
 And if it "makes sense" to us, then it is well-designed.
 
-The list data structure we have been working with is an example of an **Object**. The definition of an object allows us to ask the questions above: what does it contain, and what can it do? It is an organizational tool for a collection of data and functions that we can relate to, like a physical object. Formally, an object contains the following:
+The list data structure we have been working with is an example of an **Object**. The definition of an object allows us to ask the questions above: *what does it contain, and what can it do?* It is an organizational tool for a collection of data and functions that we can relate to, like a physical object. Formally, an object contains the following:
 
 -   **Value** that holds the essential data for the object.
 
@@ -337,7 +335,7 @@ Subset the second to fourth rows, and the first two columns:
 
 ![](images/pandas_subset_1.png)
 
-Now, back to `metadata` dataframe.
+Now, back to `metadata` dataframe:
 
 Subset the first 5 rows, and first two columns:
 

diff --git a/docs/no_toc/03-data-wrangling1.md b/docs/no_toc/03-data-wrangling1.md
@@ -4,7 +4,7 @@
 
 From our first two lessons, we are now equipped with enough fundamental programming skills to apply it to various steps in the data science workflow, which is a natural cycle that occurs in data analysis.
 
-![Data science workflow. Image source: [R for Data Science.](https://r4ds.hadley.nz/whole-game)](https://d33wubrfki0l68.cloudfront.net/571b056757d68e6df81a3e3853f54d3c76ad6efc/32d37/diagrams/data-science.png){alt="Data science workflow. Image source: R for Data Science." width="550"}
+![Data science workflow. Image source: R for Data Science.](https://d33wubrfki0l68.cloudfront.net/571b056757d68e6df81a3e3853f54d3c76ad6efc/32d37/diagrams/data-science.png){width="550"}
 
 For the rest of the course, we focus on *Transform* and *Visualize* with the assumption that our data is in a nice, "Tidy format". First, we need to understand what it means for a data to be "Tidy".
 
@@ -22,7 +22,7 @@ If you want to be technical about what variables and observations are, Hadley Wi
 
 > A **variable** contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An **observation** contains all values measured on the same unit (like a person, or a day, or a race) across attributes.
 
-![A tidy dataframe. Image source: [R for Data Science](https://r4ds.hadley.nz/data-tidy).](https://r4ds.hadley.nz/images/tidy-1.png){alt="A tidy dataframe. Image source: R for Data Science." width="800"}
+![A tidy dataframe. Image source: R for Data Science.](https://r4ds.hadley.nz/images/tidy-1.png){width="800"}
 
 ## Our working Tidy Data: DepMap Project
 
@@ -191,7 +191,7 @@ df
 ## 4     treated         7           32
 ```
 
-*"I want to subset for rows such that the status is "treated" and subset for columns status and age_case."*
+*"I want to subset for rows such that the status is"treated" and subset for columns status and age_case."*
 
 
 ``` python
@@ -210,14 +210,14 @@ df.loc[df.status == "treated", ["status", "age_case"]]
 
 Now that your Dataframe has be transformed based on your scientific question, you can start doing some analysis on it! A common data science task is to examine summary statistics of a dataset, which summarizes the all the values from a variable in a numeric summary, such as mean, median, or mode.
 
-If we look at the data structre of a Dataframe's column, it is called a Series. It has methods can compute summary statistics for us. Let's take a look at a few popular examples:
+If we look at the data structure of a Dataframe's column, it is actually not a List, but an object called Series. It has methods can compute summary statistics for us. Let's take a look at a few popular examples:
 
-| Function method                           | What it takes in                       | What it does                                                                  | Returns       |
-|---------------|---------------|----------------------------|---------------|
-| `metadata.Age.mean()`                     | `metadata.Age` as a numeric value      | Computes the mean value of the `Age` column.                                  | Float (NumPy) |
-| `metadata['Age'].median()`                | `metadata['Age']` as a numeric value   | Computes the median value of the `Age` column.                                | Float (NumPy) |
-| `metadata.Age.max()`                      | `metadata.Age` as a numeric value      | Computes the max value of the `Age` column.                                   | Float (NumPy) |
-| `metadata.OncotreeSubtype.value_counts()` | `metadata.OncotreeSubtype` as a String | Creates a frequency table of all unique elements in `OncotreeSubtype` column. | Series        |
+| Function method                           | What it takes in                              | What it does                                                                  | Returns       |
+|----------------|----------------|-------------------------|----------------|
+| `metadata.Age.mean()`                     | `metadata.Age` as a numeric Series            | Computes the mean value of the `Age` column.                                  | Float (NumPy) |
+| `metadata['Age'].median()`                | `metadata['Age']` as a numeric Series         | Computes the median value of the `Age` column.                                | Float (NumPy) |
+| `metadata.Age.max()`                      | `metadata.Age` as a numeric Series            | Computes the max value of the `Age` column.                                   | Float (NumPy) |
+| `metadata.OncotreeSubtype.value_counts()` | `metadata.OncotreeSubtype` as a string Series | Creates a frequency table of all unique elements in `OncotreeSubtype` column. | Series        |
 
 Let's try it out, with some nice print formatting:
 
@@ -270,18 +270,16 @@ print("Frequency of column", metadata.OncotreeLineage.value_counts())
 ## Name: count, dtype: int64
 ```
 
-(Notice that the output of some of these methods are Float (NumPy). This refers to a Python Object called NumPy that is extremely popular for scientific computing, but we're not focused on that in this course.)
+Notice that the output of some of these methods are Float (NumPy). This refers to a Python Object called NumPy that is extremely popular for scientific computing, but we're not focused on that in this course.)
 
 ## Simple data visualization
 
-We will dedicate extensive time later this course to talk about data visualization, but the Dataframe's column, Series, has a method called `.plot()` that can help us make plots. The `.plot()` method will default make a line plot, but it is not necessary the plot style we want, so we can give the optional argument `kind` a String value to specify the plot style. We use it for making a histogram.
+We will dedicate extensive time later this course to talk about data visualization, but the Dataframe's column, Series, has a method called `.plot()` that can help us make simple plots for one variable. The `.plot()` method will by default make a line plot, but it is not necessary the plot style we want, so we can give the optional argument `kind` a String value to specify the plot style. We use it for making a histogram or bar plot.
 
-The `.plot()` method also exists for Dataframes, in which you need to specify a plot using multiple columns. We use it for making a bar plot.
-
-| Plot style | Useful for | kind =  | Code                                                         |
-|------------|------------|---------|--------------------------------------------------------------|
-| Histogram  | Numerics   | "hist"  | `metadata.Age.plot(kind = "hist")`                           |
-| Bar plot   | Strings    | "bar"   | `metadata.OncotreeSubtype.value_counts().plot(kind = "bar")` |
+| Plot style | Useful for | kind = | Code                                                         |
+|-----------|-----------|-----------|--------------------------------------|
+| Histogram  | Numerics   | "hist" | `metadata.Age.plot(kind = "hist")`                           |
+| Bar plot   | Strings    | "bar"  | `metadata.OncotreeSubtype.value_counts().plot(kind = "bar")` |
 
 Let's look at a histogram:
 
@@ -307,7 +305,49 @@ plt.show()
 
 <img src="resources/images/03-data-wrangling1_files/figure-html/unnamed-chunk-12-3.png" width="672" />
 
-Notice here that we start with the column `metadata.OncotreeLineage`, and then we first use the method `.value_counts()` to get a *Dataframe* of a frequency table. Then, we take the frequency table Dataframe and use the `.plot()` method. It is quite common in Python to have multiple "chained" function calls, in which the output of `.value_counts()` is used for the input of `.plot()`. It takes a bit of time to get used to this!
+(The `plt.figure()` and `plt.show()` functions are used to render the plots on the website, but you don't need to use it for your exercises - yet. We will discuss this in more detail during our week of data visualization.)
+
+#### Chained function calls
+
+Let's look at our bar plot syntax more carefully. We start with the column `metadata.OncotreeLineage`, and then we first use the method `.value_counts()` to get a *Series* of a frequency table. Then, we take the frequency table Series and use the `.plot()` method.
+
+It is quite common in Python to have multiple "chained" function calls, in which the output of `.value_counts()` is used for the input of `.plot()` all in one line of code. It takes a bit of time to get used to this!
+
+Here's another example of a chained function call, which looks quite complex, but let's break it down:
+
+
+``` python
+plt.figure()
+
+metadata.loc[metadata.AgeCategory == "Adult", ].OncotreeLineage.value_counts().plot(kind="bar")
+
+plt.show()
+```
+
+<img src="resources/images/03-data-wrangling1_files/figure-html/unnamed-chunk-13-5.png" width="672" />
+
+1.  We first take the entire `metadata` and do some subsetting, which outputs a Dataframe.
+2.  We access the `OncotreeLineage` column, which outputs a Series.
+3.  We use the method `.value_counts()`, which outputs a Series.
+4.  We make a plot out of it!
+
+We could have, alternatively, done this in several lines of code:
+
+
+``` python
+plt.figure()
+
+metadata_subset = metadata.loc[metadata.AgeCategory == "Adult", ]
+metadata_subset_lineage = metadata_subset.OncotreeLineage
+lineage_freq = metadata_subset_lineage.value_counts()
+lineage_freq.plot(kind = "bar")
+
+plt.show()
+```
+
+<img src="resources/images/03-data-wrangling1_files/figure-html/unnamed-chunk-14-7.png" width="672" />
+
+These are two different *styles* of code, but they do the exact same thing. It's up to you to decide what is easier for you to understand.
 
 ## Exercises
 

diff --git a/docs/no_toc/About.md b/docs/no_toc/About.md
@@ -51,7 +51,7 @@ These credits are based on our [course contributors table guidelines](https://ww
 ##  collate  en_US.UTF-8
 ##  ctype    en_US.UTF-8
 ##  tz       Etc/UTC
-##  date     2024-08-21
+##  date     2024-08-22
 ##  pandoc   3.1.1 @ /usr/local/bin/ (via rmarkdown)
 ## 
 ## ─ Packages ───────────────────────────────────────────────────────────────────

diff --git a/docs/no_toc/about-the-authors.html b/docs/no_toc/about-the-authors.html
@@ -368,7 +368,7 @@ <h1>About the Authors<a href="about-the-authors.html#about-the-authors" class="a
 ##  collate  en_US.UTF-8
 ##  ctype    en_US.UTF-8
 ##  tz       Etc/UTC
-##  date     2024-08-21
+##  date     2024-08-22
 ##  pandoc   3.1.1 @ /usr/local/bin/ (via rmarkdown)
 ## 
 ## ─ Packages ───────────────────────────────────────────────────────────────────