diff --git a/docs/no_toc/02-data-structures.md b/docs/no_toc/02-data-structures.md index 3ded80e..164ca95 100644 --- a/docs/no_toc/02-data-structures.md +++ b/docs/no_toc/02-data-structures.md @@ -71,16 +71,14 @@ If you want to access everything but the first three elements of `chrNum`: ``` python -chrNum[3:len(chrNum)] +chrNum[3:] ``` ``` ## [2, 2] ``` -where `len(chrNum)` is the length of the list. - -When the start or stop index is *not* specified, it implies that you are subsetting starting the from the beginning of the list or subsetting to the end of the list, respectively: +Here, the stop index number was not specificed. When the start or stop index is *not* specified, it implies that you are subsetting starting the from the beginning of the list or subsetting to the end of the list, respectively: ``` python @@ -99,7 +97,7 @@ chrNum[3:] ## [2, 2] ``` -More discussion of list slicing can be found [here](https://stackoverflow.com/questions/509211/how-slicing-in-python-works). +There are other popular uses of the slice operator `:`, such as negative indicies to count from the end of a list, or subsetting with a fixed increment. You can find more discussion of list slicing [here](https://wesmckinney.com/book/python-builtin#list_slicing). ## Objects in Python @@ -111,7 +109,7 @@ The list data structure has an organization and functionality that metaphoricall And if it "makes sense" to us, then it is well-designed. -The list data structure we have been working with is an example of an **Object**. The definition of an object allows us to ask the questions above: what does it contain, and what can it do? It is an organizational tool for a collection of data and functions that we can relate to, like a physical object. Formally, an object contains the following: +The list data structure we have been working with is an example of an **Object**. The definition of an object allows us to ask the questions above: *what does it contain, and what can it do?* It is an organizational tool for a collection of data and functions that we can relate to, like a physical object. Formally, an object contains the following: - **Value** that holds the essential data for the object. @@ -337,7 +335,7 @@ Subset the second to fourth rows, and the first two columns: ![](images/pandas_subset_1.png) -Now, back to `metadata` dataframe. +Now, back to `metadata` dataframe: Subset the first 5 rows, and first two columns: diff --git a/docs/no_toc/03-data-wrangling1.md b/docs/no_toc/03-data-wrangling1.md index 263bbf1..e18c63b 100644 --- a/docs/no_toc/03-data-wrangling1.md +++ b/docs/no_toc/03-data-wrangling1.md @@ -4,7 +4,7 @@ From our first two lessons, we are now equipped with enough fundamental programming skills to apply it to various steps in the data science workflow, which is a natural cycle that occurs in data analysis. -![Data science workflow. Image source: [R for Data Science.](https://r4ds.hadley.nz/whole-game)](https://d33wubrfki0l68.cloudfront.net/571b056757d68e6df81a3e3853f54d3c76ad6efc/32d37/diagrams/data-science.png){alt="Data science workflow. Image source: R for Data Science." width="550"} +![Data science workflow. Image source: R for Data Science.](https://d33wubrfki0l68.cloudfront.net/571b056757d68e6df81a3e3853f54d3c76ad6efc/32d37/diagrams/data-science.png){width="550"} For the rest of the course, we focus on *Transform* and *Visualize* with the assumption that our data is in a nice, "Tidy format". First, we need to understand what it means for a data to be "Tidy". @@ -22,7 +22,7 @@ If you want to be technical about what variables and observations are, Hadley Wi > A **variable** contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An **observation** contains all values measured on the same unit (like a person, or a day, or a race) across attributes. -![A tidy dataframe. Image source: [R for Data Science](https://r4ds.hadley.nz/data-tidy).](https://r4ds.hadley.nz/images/tidy-1.png){alt="A tidy dataframe. Image source: R for Data Science." width="800"} +![A tidy dataframe. Image source: R for Data Science.](https://r4ds.hadley.nz/images/tidy-1.png){width="800"} ## Our working Tidy Data: DepMap Project @@ -191,7 +191,7 @@ df ## 4 treated 7 32 ``` -*"I want to subset for rows such that the status is "treated" and subset for columns status and age_case."* +*"I want to subset for rows such that the status is"treated" and subset for columns status and age_case."* ``` python @@ -210,14 +210,14 @@ df.loc[df.status == "treated", ["status", "age_case"]] Now that your Dataframe has be transformed based on your scientific question, you can start doing some analysis on it! A common data science task is to examine summary statistics of a dataset, which summarizes the all the values from a variable in a numeric summary, such as mean, median, or mode. -If we look at the data structre of a Dataframe's column, it is called a Series. It has methods can compute summary statistics for us. Let's take a look at a few popular examples: +If we look at the data structure of a Dataframe's column, it is actually not a List, but an object called Series. It has methods can compute summary statistics for us. Let's take a look at a few popular examples: -| Function method | What it takes in | What it does | Returns | -|---------------|---------------|----------------------------|---------------| -| `metadata.Age.mean()` | `metadata.Age` as a numeric value | Computes the mean value of the `Age` column. | Float (NumPy) | -| `metadata['Age'].median()` | `metadata['Age']` as a numeric value | Computes the median value of the `Age` column. | Float (NumPy) | -| `metadata.Age.max()` | `metadata.Age` as a numeric value | Computes the max value of the `Age` column. | Float (NumPy) | -| `metadata.OncotreeSubtype.value_counts()` | `metadata.OncotreeSubtype` as a String | Creates a frequency table of all unique elements in `OncotreeSubtype` column. | Series | +| Function method | What it takes in | What it does | Returns | +|----------------|----------------|-------------------------|----------------| +| `metadata.Age.mean()` | `metadata.Age` as a numeric Series | Computes the mean value of the `Age` column. | Float (NumPy) | +| `metadata['Age'].median()` | `metadata['Age']` as a numeric Series | Computes the median value of the `Age` column. | Float (NumPy) | +| `metadata.Age.max()` | `metadata.Age` as a numeric Series | Computes the max value of the `Age` column. | Float (NumPy) | +| `metadata.OncotreeSubtype.value_counts()` | `metadata.OncotreeSubtype` as a string Series | Creates a frequency table of all unique elements in `OncotreeSubtype` column. | Series | Let's try it out, with some nice print formatting: @@ -270,18 +270,16 @@ print("Frequency of column", metadata.OncotreeLineage.value_counts()) ## Name: count, dtype: int64 ``` -(Notice that the output of some of these methods are Float (NumPy). This refers to a Python Object called NumPy that is extremely popular for scientific computing, but we're not focused on that in this course.) +Notice that the output of some of these methods are Float (NumPy). This refers to a Python Object called NumPy that is extremely popular for scientific computing, but we're not focused on that in this course.) ## Simple data visualization -We will dedicate extensive time later this course to talk about data visualization, but the Dataframe's column, Series, has a method called `.plot()` that can help us make plots. The `.plot()` method will default make a line plot, but it is not necessary the plot style we want, so we can give the optional argument `kind` a String value to specify the plot style. We use it for making a histogram. +We will dedicate extensive time later this course to talk about data visualization, but the Dataframe's column, Series, has a method called `.plot()` that can help us make simple plots for one variable. The `.plot()` method will by default make a line plot, but it is not necessary the plot style we want, so we can give the optional argument `kind` a String value to specify the plot style. We use it for making a histogram or bar plot. -The `.plot()` method also exists for Dataframes, in which you need to specify a plot using multiple columns. We use it for making a bar plot. - -| Plot style | Useful for | kind = | Code | -|------------|------------|---------|--------------------------------------------------------------| -| Histogram | Numerics | "hist" | `metadata.Age.plot(kind = "hist")` | -| Bar plot | Strings | "bar" | `metadata.OncotreeSubtype.value_counts().plot(kind = "bar")` | +| Plot style | Useful for | kind = | Code | +|-----------|-----------|-----------|--------------------------------------| +| Histogram | Numerics | "hist" | `metadata.Age.plot(kind = "hist")` | +| Bar plot | Strings | "bar" | `metadata.OncotreeSubtype.value_counts().plot(kind = "bar")` | Let's look at a histogram: @@ -307,7 +305,49 @@ plt.show() -Notice here that we start with the column `metadata.OncotreeLineage`, and then we first use the method `.value_counts()` to get a *Dataframe* of a frequency table. Then, we take the frequency table Dataframe and use the `.plot()` method. It is quite common in Python to have multiple "chained" function calls, in which the output of `.value_counts()` is used for the input of `.plot()`. It takes a bit of time to get used to this! +(The `plt.figure()` and `plt.show()` functions are used to render the plots on the website, but you don't need to use it for your exercises - yet. We will discuss this in more detail during our week of data visualization.) + +#### Chained function calls + +Let's look at our bar plot syntax more carefully. We start with the column `metadata.OncotreeLineage`, and then we first use the method `.value_counts()` to get a *Series* of a frequency table. Then, we take the frequency table Series and use the `.plot()` method. + +It is quite common in Python to have multiple "chained" function calls, in which the output of `.value_counts()` is used for the input of `.plot()` all in one line of code. It takes a bit of time to get used to this! + +Here's another example of a chained function call, which looks quite complex, but let's break it down: + + +``` python +plt.figure() + +metadata.loc[metadata.AgeCategory == "Adult", ].OncotreeLineage.value_counts().plot(kind="bar") + +plt.show() +``` + + + +1. We first take the entire `metadata` and do some subsetting, which outputs a Dataframe. +2. We access the `OncotreeLineage` column, which outputs a Series. +3. We use the method `.value_counts()`, which outputs a Series. +4. We make a plot out of it! + +We could have, alternatively, done this in several lines of code: + + +``` python +plt.figure() + +metadata_subset = metadata.loc[metadata.AgeCategory == "Adult", ] +metadata_subset_lineage = metadata_subset.OncotreeLineage +lineage_freq = metadata_subset_lineage.value_counts() +lineage_freq.plot(kind = "bar") + +plt.show() +``` + + + +These are two different *styles* of code, but they do the exact same thing. It's up to you to decide what is easier for you to understand. ## Exercises diff --git a/docs/no_toc/About.md b/docs/no_toc/About.md index 0cbea42..9eac26b 100644 --- a/docs/no_toc/About.md +++ b/docs/no_toc/About.md @@ -51,7 +51,7 @@ These credits are based on our [course contributors table guidelines](https://ww ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC -## date 2024-08-21 +## date 2024-08-22 ## pandoc 3.1.1 @ /usr/local/bin/ (via rmarkdown) ## ## ─ Packages ─────────────────────────────────────────────────────────────────── diff --git a/docs/no_toc/about-the-authors.html b/docs/no_toc/about-the-authors.html index fb30c02..2d70064 100644 --- a/docs/no_toc/about-the-authors.html +++ b/docs/no_toc/about-the-authors.html @@ -368,7 +368,7 @@
From our first two lessons, we are now equipped with enough fundamental programming skills to apply it to various steps in the data science workflow, which is a natural cycle that occurs in data analysis.
For the rest of the course, we focus on Transform and Visualize with the assumption that our data is in a nice, “Tidy format”. First, we need to understand what it means for a data to be “Tidy”.
Now that your Dataframe has be transformed based on your scientific question, you can start doing some analysis on it! A common data science task is to examine summary statistics of a dataset, which summarizes the all the values from a variable in a numeric summary, such as mean, median, or mode.
-If we look at the data structre of a Dataframe’s column, it is called a Series. It has methods can compute summary statistics for us. Let’s take a look at a few popular examples:
+If we look at the data structure of a Dataframe’s column, it is actually not a List, but an object called Series. It has methods can compute summary statistics for us. Let’s take a look at a few popular examples:
metadata.Age.mean() |
-metadata.Age as a numeric value |
+metadata.Age as a numeric Series |
Computes the mean value of the Age column. |
Float (NumPy) |
metadata['Age'].median() |
-metadata['Age'] as a numeric value |
+metadata['Age'] as a numeric Series |
Computes the median value of the Age column. |
Float (NumPy) |
metadata.Age.max() |
-metadata.Age as a numeric value |
+metadata.Age as a numeric Series |
Computes the max value of the Age column. |
Float (NumPy) |
metadata.OncotreeSubtype.value_counts() |
-metadata.OncotreeSubtype as a String |
+metadata.OncotreeSubtype as a string Series |
Creates a frequency table of all unique elements in OncotreeSubtype column. |
Series |