diff --git a/_episodes_rmd/10-data-organisation.Rmd b/_episodes_rmd/10-data-organisation.Rmd index 316ca5ae7..f5575b243 100644 --- a/_episodes_rmd/10-data-organisation.Rmd +++ b/_episodes_rmd/10-data-organisation.Rmd @@ -1,16 +1,16 @@ --- source: Rmd -title: "Data organisation with Spreadsheets" +title: "Data organisation with spreadsheets" teaching: 30 exercises: 30 questions: - "How to organise tabular data?" objectives: -- "Learn about spreadsheets, their strengths and weaknesses" +- "Learn about spreadsheets, their strengths and weaknesses." - "How do we format data in spreadsheets for effective data use?" - "Learn about common spreadsheet errors and how to correct them." -- "Organize your data according to tidy data principles." -- "Learn about text-based spreadsheet formats such as the comma-separated (CSV) or tab-separated formats." +- "Organise your data according to tidy data principles." +- "Learn about text-based spreadsheet formats such as the comma-separated (CSV) or tab-separated (TSV) formats." keypoints: - "Good data organization is the foundation of any research project." --- @@ -29,7 +29,7 @@ source("../bin/chunk-options.R") **Objective** - Describe best practices for organizing data so computers can make - the best use of data sets. + the best use of datasets. **Keypoint** @@ -101,7 +101,7 @@ ensure efficient downstream analysis. > frustrated or sad? {: .challenge} -### Problems with Spreadsheets +### Problems with spreadsheets Spreadsheets are good for data entry, but in reality we tend to use spreadsheet programs for much more than data entry. We use them @@ -126,7 +126,7 @@ command-line based statistics program like R or SAS, it’s practically impossible to apply a calculation to one observation in your dataset but not another unless you’re doing it on purpose. -### Using Spreadsheets for Data Entry and Cleaning +### Using spreadsheets for data entry and cleaning In this lesson, we will assume that you are most likely using Excel as your primary spreadsheet program - there are others (gnumeric, Calc @@ -161,7 +161,7 @@ In this lesson we're going to talk about: - Keep track of all of the steps you take to clean your data in a plain text file. -- Organize your data according to tidy data principles. +- Organise your data according to tidy data principles. The most common mistake made is treating spreadsheet programs like lab notebooks, that is, relying on context, notes in the margin, spatial @@ -171,7 +171,7 @@ the same way, and unless we explain to the computer what every single thing means (and that can be hard!), it will not be able to see how our data fits together. -Using the power of computers, we can manage and analyze data in much +Using the power of computers, we can manage and analyse data in much more effective and faster ways, but to use that power, we have to set up our data for the computer to be able to understand it (and computers are very literal). @@ -200,7 +200,7 @@ different from the one you started with. In order to be able to reproduce your analyses or figure out what you did when a reviewer or instructor asks for a different analysis, you should -- create a new file with your cleaned or analyzed data. Don't modify +- create a new file with your cleaned or analysed data. Don't modify the original dataset, or you will never know where you started! - keep track of the steps you took in your clean up or analysis. You @@ -260,9 +260,9 @@ used for variables** and **rows are used for observations**: - rows are observations - cells are individual values -> ## Challenge: We're going to take a messy data and describe how we would clean it up. +> ## Challenge: We're going to take a messy dataset and describe how we would clean it up. > -> 1. Download a messy data by clicking +> 1. Download a messy dataset by clicking > [here](https://github.com/UCLouvain-CBIO/WSBIM1207/raw/master/data/messy_covid.xlsx). > > 2. Open up the data in a spreadsheet program. @@ -303,7 +303,7 @@ wrong with this data and how you would fix it. > - How many A, AB, and B types have been tested? > - As above, but disregarding the contaminated samples? > - How many Rhesus + and - have been tested? -> - How many universal donors (0-) have been tested? +> - How many universal donors (O-) have been tested? > - What is the average weight of AB men? > - How many samples have been tested in the different hospitals? {: .challenge} @@ -311,7 +311,7 @@ wrong with this data and how you would fix it. An **excellent reference**, in particular with regard to R scripting is the *Tidy Data* paper @Wickham:2014. -## Common Spreadsheet Errors +## Common spreadsheet errors **Questions** @@ -320,7 +320,7 @@ is the *Tidy Data* paper @Wickham:2014. **Objectives** -- Recognize and resolve common spreadsheet formatting problems. +- Recognise and resolve common spreadsheet formatting problems. **Keypoints** @@ -384,7 +384,7 @@ samples. Other rows are similarly problematic. ### Using multiple tabs {#tabs} -But what about workbook tabs? That seems like an easy way to organize +But what about workbook tabs? That seems like an easy way to organise data, right? Well, yes and no. When you create extra tabs, you fail to allow the computer to see connections in the data that are there (you have to introduce spreadsheet application-specific functions or @@ -398,7 +398,7 @@ This isn't good practice for two reasons: in a new tab, and 2. even if you manage to prevent all inconsistencies from creeping in, - you will add an extra step for yourself before you analyze the data + you will add an extra step for yourself before you analyse the data because you will have to combine these data into a single datatable. You will have to explicitly tell the computer how to combine tabs - and if the tabs are inconsistently formatted, you @@ -408,7 +408,7 @@ The next time you’re entering data, and you go to create another tab or table, ask yourself if you could avoid adding this tab by adding another column to your original spreadsheet. We used multiple tabs in our example of a messy data file, but now you've seen how you can -reorganize your data to consolidate across tabs. +reorganise your data to consolidate across tabs. Your data sheet might get very long over the course of the experiment. This makes it harder to enter data if you can’t see your @@ -431,7 +431,7 @@ counted it. A blank cell means that it wasn't measured and the computer will interpret it as an unknown value (also known as a null or missing value). -The spreadsheets or statistical programs will likely mis-interpret +The spreadsheets or statistical programs will likely misinterpret blank cells that you intend to be zeros. By not entering the value of your observation, you are telling your computer to represent that data as unknown or missing (null). This can cause problems with subsequent @@ -450,7 +450,7 @@ missing data as nulls. **Solutions**: There are a few reasons why null values get represented differently -within a dataset. Sometimes confusing null values are automatically +within a dataset. Sometimes confusing null values are automatically recorded from the measuring device. If that's the case, there's not much you can do, but it can be addressed in data cleaning with a tool like @@ -466,9 +466,9 @@ different reasons. Whatever the reason, it's a problem if unknown or missing data is recorded as -999, 999, or 0. -Many statistical programs will not recognize that these are intended +Many statistical programs will not recognise that these are intended to represent missing (null) values. How these values are interpreted -will depend on the software you use to analyze your data. It is +will depend on the software you use to analyse your data. It is essential to use a clearly defined and consistent null indicator. Blanks (most applications) and NA (for R) are good @@ -499,7 +499,7 @@ for different software applications in their article: aesthetically pleasing can compromise your computer’s ability to see associations in the data. Merged cells will make your data unreadable by statistics software. Consider restructuring your data in such a way -that you will not need to merge cells to organize your data. +that you will not need to merge cells to organise your data. ### Placing comments or units in cells {#units} @@ -519,9 +519,9 @@ specify the units the cell is in. B+, A-, ... **Solution**: Don't include more than one piece of information in a -cell. This will limit the ways in which you can analyze your data. If +cell. This will limit the ways in which you can analyse your data. If you need both these measurements, design your data sheet to include -this information. For example, include one column the ABO group and +this information. For example, include one column for the ABO group and one for the Rhesus group. ### Using problematic field names {#field_name} @@ -560,15 +560,15 @@ other applications. **Solution**: This is a common strategy. For example, when writing longer text in a cell, people often include line breaks, em-dashes, -etc in their spreadsheet. Also, when copying data in from +etc. in their spreadsheet. Also, when copying data in from applications such as Word, formatting and fancy non-standard characters (such as left- and right-aligned quotation marks) are -included. When exporting this data into a coding/statistical +included. When exporting this data into a coding/statistical environment or into a relational database, dangerous things may occur, such as lines being cut in half and encoding errors being thrown. General best practice is to avoid adding characters such as newlines, -tabs, and vertical tabs. In other words, treat a text cell as if it +tabs, and vertical tabs. In other words, treat a text cell as if it were a simple web form that can only contain text and spaces. @@ -674,7 +674,7 @@ text files where the columns are separated by commas, hence 'comma separated values' or CSV. The advantage of a CSV file over an Excel/SPSS/etc. file is that we can open and read a CSV file using just about any software, including plain text editors like TextEdit or -NotePad. Data in a CSV file can also be easily imported into other +NotePad. Data in a CSV file can also be easily imported into other formats and environments, such as SQLite and R. We're not tied to a certain version of a certain expensive program when we work with CSV files, so it's a good format to work with for maximum portability and @@ -703,10 +703,10 @@ different worksheets in the `xls` documents. **But** -- some of these only work on Windows +- some of these only work on Windows. - this equates to replacing a (simple but manual) export to `csv` with - additional complexity/dependencies in the data analysis R code -- data formatting best practice still apply + additional complexity/dependencies in the data analysis R code. +- data formatting best practice still apply. - Is there really a good reason why `csv` (or similar) is not adequate? @@ -798,14 +798,14 @@ build relevant scripts. knitr::include_graphics("../fig/analysis.png") ``` -A typical data analysis worflow is illustrated in the figure above, -where data is repeatedly tranformed, visualised, modelled. This +A typical data analysis workflow is illustrated in the figure above, +where data is repeatedly tranformed, visualised, and modelled. This iteration is repeated multiple times until the data is understood. In many real-life cases, however, most time is spent cleaning up and preparing the data, rather than actually analysing and understanding it. An agile data analysis workflow, with several fast iterations of the -transform/visualise/model cycle is only feasible is the data is +transform/visualise/model cycle is only feasible if the data is formatted in a predictable way and one can reason about the data without having to look at it and/or fix it. diff --git a/_episodes_rmd/20-r-rstudio.Rmd b/_episodes_rmd/20-r-rstudio.Rmd index 92e370cbe..5ed765f72 100644 --- a/_episodes_rmd/20-r-rstudio.Rmd +++ b/_episodes_rmd/20-r-rstudio.Rmd @@ -7,7 +7,7 @@ questions: - "What are R and RStudio?" objectives: - "Describe the purpose of the RStudio Script, Console, Environment, and Plots panes." -- "Organize files and directories for a set of analyses as an R project, and understand the purpose of the working directory." +- "Organise files and directories for a set of analyses as an R project, and understand the purpose of the working directory." - "Use the built-in RStudio help interface to search for more information on R functions." - "Demonstrate how to provide sufficient information for troubleshooting with the R user community." keypoints: @@ -31,12 +31,12 @@ therefore both need to be installed on your computer. [^plainr]: As opposed to using R directly from the command line console. There exist other software that interface and integrate - with R, but RStudio is particularly well suited for beginners and + with R, but RStudio is particularly well suited for beginners while providing numerous very advanced features. The [RStudio IDE Cheat Sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/rstudio-ide.pdf) -provides much more information that will be covered here, but can be +provides much more information than will be covered here, but can be useful to learn keyboard shortcuts and discover new features. ## Why learn R? @@ -79,7 +79,7 @@ requirements. With 10000+ packages[^whatarepkgs] that can be installed to extend its capabilities, R provides a framework that allows you to combine statistical approaches from many scientific disciplines to best suit -the analytical framework you need to analyze your data. For instance, +the analytical framework you need to analyse your data. For instance, R has packages for image analysis, GIS, time series, population genetics, and a lot more. @@ -116,8 +116,8 @@ your data. Thousands of people use R daily. Many of them are willing to help you through mailing lists and websites such as [Stack Overflow](https://stackoverflow.com/), or on the [RStudio -community](https://community.rstudio.com/). These broad user community -extends to specialised areas such as bioinformatics. +community](https://community.rstudio.com/). These broad user communities +extend to specialised areas such as bioinformatics. ### Not only is R free, but it is also open-source and cross-platform @@ -138,7 +138,7 @@ The RStudio IDE is also available with a commercial license and priority email support from RStudio, Inc. We will use the RStudio IDE to write code, navigate the files on our -computer, inspect the variables we are going to create, and visualize +computer, inspect the variables we are going to create, and visualise the plots we will generate. RStudio can also be used for other things (e.g., version control, developing packages, writing Shiny apps) that we will not cover during the workshop. @@ -156,7 +156,7 @@ The RStudio window is divided into 4 "Panes": - your **Files/Plots/Packages/Help/Viewer** (bottom-right), and - the R **Console** (bottom-left). -The placement of these panes and their content can be customized (see +The placement of these panes and their content can be customised (see menu, `Tools -> Global Options -> Pane Layout`). One of the advantages of using RStudio is that all the information you @@ -202,7 +202,7 @@ for 'Save workspace to .RData' on exit. knitr::include_graphics("../fig/rstudio-preferences.png") ``` -To avoid [character encoding issue between Windows and other operating +To avoid [character encoding issues between Windows and other operating systems](https://yihui.name/en/2018/11/biggest-regret-knitr/), we are going to set UTF-8 by default: @@ -214,7 +214,7 @@ knitr::include_graphics("../fig/utf8.png") ### Organizing your working directory Using a consistent folder structure across your projects will help keep things -organized, and will also make it easy to find/file things in the future. This +organised, and will also make it easy to find/file things in the future. This can be especially helpful when you have multiple projects. In general, you may create directories (folders) for **scripts**, **data**, and **documents**. @@ -366,7 +366,7 @@ commands, but they will be forgotten when you close the session. Because we want our code and workflow to be reproducible, it is better to type the commands we want in the script editor, and save the script. This way, there is a complete record of what we did, and -anyone (including our future selves!) can easily replicate the +anyone (including our future selves!) can easily replicate the results on their computer. Note, however, that merely typing the commands in the script does not automatically *run* them - they still need to be sent to the console for execution. @@ -388,7 +388,7 @@ them directly in the console. RStudio provides the `Ctrl` + `1` and console panes. If R is ready to accept commands, the R console shows a `>` prompt. If -it receives a command (by typing, copy-pasting or sent from the script +it receives a command (by typing, copy-pasting or sending from the script editor using `Ctrl` + `Enter`), R will try to execute it, and when ready, will show the results and come back with a new `>` prompt to wait for new commands. @@ -514,7 +514,7 @@ If possible, try to reduce what doesn't work to a simple *reproducible example*. If you can reproduce the problem using a very small data frame instead of your 50000 rows and 10000 columns one, provide the small one with the description of your problem. When appropriate, try -to generalize what you are doing so even people who are not in your +to generalise what you are doing so even people who are not in your field can understand the question. For instance instead of using a subset of your real dataset, create a small (3 columns, 5 rows) generic one. For more information on how to write a reproducible @@ -600,7 +600,7 @@ sessionInfo() - [How to ask for R help](http://blog.revolutionanalytics.com/2014/01/how-to-ask-for-r-help.html) - useful guidelines + useful guidelines. - [This blog post by Jon Skeet](http://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question/) @@ -649,7 +649,7 @@ namely `BiocManager`, that can be installed from CRAN with install.packages("BiocManager") ``` -Individuals package such as `SummarizedExperiment` (we will use it +Individual packages such as `SummarizedExperiment` (we will use it later), `DESeq2` (for RNA-Seq analysis), and many more can then be installed with `BiocManager::install`. diff --git a/_episodes_rmd/30-dplyr.Rmd b/_episodes_rmd/30-dplyr.Rmd index 997adb180..d3abf1a7e 100644 --- a/_episodes_rmd/30-dplyr.Rmd +++ b/_episodes_rmd/30-dplyr.Rmd @@ -1,6 +1,6 @@ --- source: Rmd -title: "Manipulating and analyzing data with dplyr" +title: "Manipulating and analysing data with dplyr" teaching: 75 exercises: 75 questions: @@ -25,7 +25,7 @@ download.file(url = "https://raw.githubusercontent.com/Bioconductor/bioconductor -## Data Manipulation using **`dplyr`** and **`tidyr`** +## Data manipulation using **`dplyr`** and **`tidyr`** Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations. @@ -40,10 +40,10 @@ R session when you need it. - The package **`dplyr`** provides powerful tools for data manipulation tasks. It is built to work directly with data frames, with many manipulation tasks -optimized. +optimised. - As we will see latter on, sometimes we want a data frame to be reshaped to be able -to do some specific analyses or for visualization. The package **`tidyr`** addresses +to do some specific analyses or for visualisation. The package **`tidyr`** addresses this common problem of reshaping data and provides tools for manipulating data in a tidy way. @@ -59,7 +59,7 @@ several useful packages for data analysis which work well together, such as **`tidyr`**, **`dplyr`**, **`ggplot2`**, **`tibble`**, etc. These packages help us to work and interact with the data. They allow us to do many things with your data, such as subsetting, transforming, -visualizing, etc. +visualising, etc. To install and load the **`tidyverse`** package type: @@ -105,7 +105,7 @@ We are now going to learn some of the most common **`dplyr`** functions: - `select()`: subset columns - `filter()`: subset rows on conditions - `mutate()`: create new columns by using information from other columns -- `group_by()` and `summarize()`: create summary statistics on grouped data +- `group_by()` and `summarise()`: create summary statistics on grouped data - `arrange()`: sort results - `count()`: count discrete values @@ -149,7 +149,7 @@ genes <- select(rna, gene, hsapiens_homolog_associated_gene_name) genes ``` -Some mouse gene have no human homologs. These can be retrieved using a +Some mouse genes have no human homologs. These can be retrieved using `filter()` and the `is.na()` function, that determines whether something is an `NA`. @@ -157,8 +157,8 @@ something is an `NA`. filter(genes, is.na(hsapiens_homolog_associated_gene_name)) ``` -If we want to keep only mouse gene that have a human homolog, we can -insert a "!" symbol that negates the result, so we're asking for +If we want to keep only mouse genes that have a human homolog, we can +insert a "!" symbol that negates the result, so we're asking for every row where hsapiens_homolog_associated_gene_name *is not* an `NA`. @@ -248,7 +248,7 @@ rna3 > > Using pipes, subset the `rna` data to keep observations in female mice at time 0, > where the gene has an expression higher than 50000, and retain only the columns -> `gene`, `sample`, `time`, `expression` and `age` +> `gene`, `sample`, `time`, `expression` and `age`. > > > ## Solution > > @@ -296,7 +296,7 @@ rna %>% > criteria: contains only the `gene`, `chromosome_name`, > `phenotype_description`, `sample`, and `expression` columns. The expression > values should be log-transformed. This data frame must -> only contain genes located on sex chromosome, associated with a +> only contain genes located on sex chromosomes, associated with a > phenotype_description, and with a log expression higher than 5. > > **Hint**: think about how the commands should be ordered to produce @@ -349,9 +349,9 @@ Once the data has been grouped, subsequent operations will be applied on each group independently. -### The `summarize()` function +### The `summarise()` function -`group_by()` is often used together with `summarize()`, which +`group_by()` is often used together with `summarise()`, which collapses each group into a single-row summary of that group. `group_by()` takes as arguments the column names that contain the @@ -361,7 +361,7 @@ statistics. So to compute the mean `expression` by gene: ```{r} rna %>% group_by(gene) %>% - summarize(mean_expression = mean(expression)) + summarise(mean_expression = mean(expression)) ``` We could also want to calculate the mean expression levels of all genes in each sample: @@ -369,7 +369,7 @@ We could also want to calculate the mean expression levels of all genes in each ```{r} rna %>% group_by(sample) %>% - summarize(mean_expression = mean(expression)) + summarise(mean_expression = mean(expression)) ``` But we can can also group by multiple columns: @@ -377,17 +377,17 @@ But we can can also group by multiple columns: ```{r} rna %>% group_by(gene, infection, time) %>% - summarize(mean_expression = mean(expression)) + summarise(mean_expression = mean(expression)) ``` -Once the data is grouped, you can also summarize multiple variables at the same +Once the data is grouped, you can also summarise multiple variables at the same time (and not necessarily on the same variable). For instance, we could add a column indicating the median `expression` by gene and by condition: ```{r, purl = FALSE} rna %>% group_by(gene, infection, time) %>% - summarize(mean_expression = mean(expression), + summarise(mean_expression = mean(expression), median_expression = median(expression)) ``` @@ -401,7 +401,7 @@ rna %>% > > rna %>% > > filter(gene == "Dok3") %>% > > group_by(time) %>% -> > summarize(mean = mean(expression)) +> > summarise(mean = mean(expression)) > > ``` > > > {: .solution} @@ -412,14 +412,14 @@ rna %>% When working with data, we often want to know the number of observations found for each factor or combination of factors. For this task, **`dplyr`** provides `count()`. For example, if we wanted to count the number of rows of data for -each infected and non infected samples, we would do: +each infected and non-infected samples, we would do: ```{r, purl = FALSE} rna %>% count(infection) ``` -The `count()` function is shorthand for something we've already seen: grouping by a variable, and summarizing it by counting the number of observations in that group. In other words, `rna %>% count(infection)` is equivalent to: +The `count()` function is shorthand for something we've already seen: grouping by a variable, and summarising it by counting the number of observations in that group. In other words, `rna %>% count(infection)` is equivalent to: ```{r, purl = FALSE} rna %>% @@ -428,9 +428,9 @@ rna %>% ``` -Previous example shows the use of `count()` to count the number of rows/observations +The previous example shows the use of `count()` to count the number of rows/observations for *one* factor (i.e., `infection`). -If we wanted to count *combination of factors*, such as `infection` and `time`, +If we wanted to count a *combination of factors*, such as `infection` and `time`, we would specify the first and the second factor as the arguments of `count()`: ```{r purl = FALSE} @@ -443,7 +443,7 @@ which is equivalent to this: ```{r purl = FALSE} rna %>% group_by(infection, time) %>% - summarize(n = n()) + summarise(n = n()) ``` It is sometimes useful to sort the result to facilitate the comparisons. @@ -478,9 +478,9 @@ rna %>% > ## Challenge > > 1. How many genes were analysed in each sample? -> 2. Use `group_by()` and `summarize()` to evaluate the sequencing depth (the sum of all counts) in each sample. Which sample has the highest sequencing depth? -> 3. Pick one sample and evaluate the number of genes by biotype -> 4. Identify genes associated with "abnormal DNA methylation" phenotype description, and calculate their mean expression (in log) at time 0, time 4 and time 8. +> 2. Use `group_by()` and `summarise()` to evaluate the sequencing depth (the sum of all counts) in each sample. Which sample has the highest sequencing depth? +> 3. Pick one sample and evaluate the number of genes by biotype. +> 4. Identify genes associated with the "abnormal DNA methylation" phenotype description, and calculate their mean expression (in log) at time 0, time 4 and time 8. > > > ## Solution > > @@ -491,7 +491,7 @@ rna %>% > > ## 2. > > rna %>% > > group_by(sample) %>% -> > summarize(seq_depth = sum(expression)) %>% +> > summarise(seq_depth = sum(expression)) %>% > > arrange(desc(seq_depth)) > > ## 3. > > rna %>% @@ -503,7 +503,7 @@ rna %>% > > rna %>% > > filter(phenotype_description == "abnormal DNA methylation") %>% > > group_by(gene, time) %>% -> > summarize(mean_expression = mean(log(expression))) %>% +> > summarise(mean_expression = mean(log(expression))) %>% > > arrange() > > ``` > > @@ -517,7 +517,7 @@ In the `rna` tibble, the rows contain expression values (the unit) that are associated with a combination of 2 other variables: `gene` and `sample`. All the other columns correspond to variables describing either -the sample (organism, age, sex,...) or the gene (gene_biotype, ENTREZ_ID, product...). +the sample (organism, age, sex, ...) or the gene (gene_biotype, ENTREZ_ID, product, ...). The variables that don’t change with genes or with samples will have the same value in all the rows. @@ -536,7 +536,7 @@ a `wide-format` is preferred, as a more compact way of representing the data. This is typically the case with gene expression values that scientists are used to look as matrices, were rows represent genes and columns represent samples. -In this format, it would become therefore straightforward +In this format, it would therefore become straightforward to explore the relationship between the gene expression levels within, and between, the samples. @@ -566,8 +566,8 @@ details). ### Pivoting the data into a wider format -Let's first select the 3 first columns of `rna` and use `pivot_wider()` -to transform data in a wide-format. +Let's select the first 3 columns of `rna` and use `pivot_wider()` +to transform the data into a wide-format. ```{r, purl = FALSE} rna_exp <- rna %>% @@ -617,7 +617,7 @@ rna_with_missing_values ``` By default, the `pivot_wider()` function will add `NA` for missing -values. This can be parametrised with the `values_fill` argument of +values. This can be parameterised with the `values_fill` argument of the `pivot_wider()` function. ```{r, purl = FALSE} @@ -654,7 +654,7 @@ associated with the column names. knitr::include_graphics("../fig/pivot_longer.png") ``` -To recreate `rna_long` from `rna_long` we would create a key +To recreate `rna_long` from `rna_wide` we would create a key called `sample` and value called `expression` and use all columns except `gene` for the key variable. Here we drop `gene` column with a minus sign. @@ -744,7 +744,7 @@ so every replicate has the same composition. > ```{r, echo = FALSE, message = FALSE} > knitr::include_graphics("../fig/Exercise_pivot_W.png") > ``` -> You will need to summarize before reshaping! +> You will need to summarise before reshaping! > > > ## Solution > > @@ -755,7 +755,7 @@ so every replicate has the same composition. > > rna %>% > > filter(chromosome_name == "Y" | chromosome_name == "X") %>% > > group_by(sex, chromosome_name) %>% -> > summarize(mean = mean(expression)) +> > summarise(mean = mean(expression)) > > ``` > > > > And pivot the table to wide format @@ -764,7 +764,7 @@ so every replicate has the same composition. > > rna_1 <- rna %>% > > filter(chromosome_name == "Y" | chromosome_name == "X") %>% > > group_by(sex, chromosome_name) %>% -> > summarize(mean = mean(expression)) %>% +> > summarise(mean = mean(expression)) %>% > > pivot_wider(names_from = sex, > > values_from = mean) > > @@ -778,7 +778,7 @@ so every replicate has the same composition. > > rna_1 %>% > > pivot_longer(names_to = "gender", > > values_to = "mean", -> > - chromosome_name) +> > -chromosome_name) > > > > ``` > > @@ -788,7 +788,7 @@ so every replicate has the same composition. > ## Question > -> Use the `rna` dataset to create an expression matrix were each row +> Use the `rna` dataset to create an expression matrix where each row > represents the mean expression levels of genes and columns represent > the different timepoints. > @@ -798,7 +798,7 @@ so every replicate has the same composition. > > ```{r} > > rna %>% > > group_by(gene, time) %>% -> > summarize(mean_exp = mean(expression)) +> > summarise(mean_exp = mean(expression)) > > ``` > > > > before using the pivot_wider() function @@ -806,7 +806,7 @@ so every replicate has the same composition. > > ```{r} > > rna_time <- rna %>% > > group_by(gene, time) %>% -> > summarize(mean_exp = mean(expression)) %>% +> > summarise(mean_exp = mean(expression)) %>% > > pivot_wider(names_from = time, > > values_from = mean_exp) > > rna_time @@ -819,7 +819,7 @@ so every replicate has the same composition. > > ```{r} > > rna %>% > > group_by(gene, time) %>% -> > summarize(mean_exp = mean(expression)) %>% +> > summarise(mean_exp = mean(expression)) %>% > > pivot_wider(names_from = time, > > values_from = mean_exp) %>% > > select(gene, 4) @@ -830,7 +830,7 @@ so every replicate has the same composition. > > ```{r} > > rna %>% > > group_by(gene, time) %>% -> > summarize(mean_exp = mean(expression)) %>% +> > summarise(mean_exp = mean(expression)) %>% > > pivot_wider(names_from = time, > > values_from = mean_exp) %>% > > select(gene, `4`) @@ -842,7 +842,7 @@ so every replicate has the same composition. > > ```{r} > > rna %>% > > group_by(gene, time) %>% -> > summarize(mean_exp = mean(expression)) %>% +> > summarise(mean_exp = mean(expression)) %>% > > pivot_wider(names_from = time, > > values_from = mean_exp) %>% > > rename("time0" = `0`, "time4" = `4`, "time8" = `8`) %>% @@ -860,7 +860,7 @@ so every replicate has the same composition. > Use the previous data frame containing mean expression levels per timepoint and create > a new column containing fold-changes between timepoint 8 and timepoint 0, and fold-changes > between timepoint 8 and timepoint 4. -> Convert this table in a long-format table gathering the foldchanges calculated. +> Convert this table into a long-format table gathering the fold-changes calculated. > > > ## Solution > > @@ -869,7 +869,7 @@ so every replicate has the same composition. > > rna_time > > ``` > > -> > Calculate FoldChanges: +> > Calculate fold-changes: > > > > ```{r} > > rna_time %>% @@ -932,10 +932,10 @@ annot1 ``` We now want to join these two tables into a single one containing all -variables using the `full_join` function from the `dplyr` package. The +variables using the `full_join()` function from the `dplyr` package. The function will automatically find the common variable to match columns from the first and second table. In this case, `gene` is the common -variable. Such variables are called keys. Keys are used to match +variable. Such variables are called keys. Keys are used to match observations across different tables. @@ -970,7 +970,7 @@ in the joined one. > Download the annot3 table by clicking > [here](https://raw.githubusercontent.com/aloriot/bioc-intro/main/_episodes_rmd/data/annot3.csv) > and put the table in your data/ repository. -> Using the `full_join` function, join tables `rna_mini` +> Using the `full_join()` function, join tables `rna_mini` > and `annot3`. What has happened for genes *Klk6*, *mt-Tf*, *mt-Rnr1*, *mt-Tv*, > *mt-Rnr2*, and *mt-Tl1* ? > @@ -983,7 +983,7 @@ in the joined one. > > > > > > Genes *Klk6* is only present in `rna_mini`, while genes *mt-Tf*, *mt-Rnr1*, *mt-Tv*, -> > *mt-Rnr2*, and *mt-Tl1* are only present in `annot3` table.Their respective values for the +> > *mt-Rnr2*, and *mt-Tl1* are only present in `annot3` table. Their respective values for the > > variables of the table have been encoded as missing. > > > {: .solution} @@ -992,7 +992,7 @@ in the joined one. ## Exporting data Now that you have learned how to use `dplyr` to extract information from -or summarize your raw data, you may want to export these new data sets to share +or summarise your raw data, you may want to export these new data sets to share them with your collaborators or for archival. Similar to the `read_csv()` function used for reading CSV files into R, there is