carpentries-incubator · lgatto · Oct 7, 2022 · Oct 4, 2022 · Oct 4, 2022 · Oct 4, 2022
diff --git a/_episodes_rmd/10-data-organisation.Rmd b/_episodes_rmd/10-data-organisation.Rmd
@@ -1,16 +1,16 @@
 ---
 source: Rmd
-title: "Data organisation with Spreadsheets"
+title: "Data organisation with spreadsheets"
 teaching: 30
 exercises: 30
 questions:
 - "How to organise tabular data?"
 objectives:
-- "Learn about spreadsheets, their strengths and weaknesses"
+- "Learn about spreadsheets, their strengths and weaknesses."
 - "How do we format data in spreadsheets for effective data use?"
 - "Learn about common spreadsheet errors and how to correct them."
-- "Organize your data according to tidy data principles."
-- "Learn about text-based spreadsheet formats such as the comma-separated (CSV) or tab-separated formats."
+- "Organise your data according to tidy data principles."
+- "Learn about text-based spreadsheet formats such as the comma-separated (CSV) or tab-separated (TSV) formats."
 keypoints:
 - "Good data organization is the foundation of any research project."
 ---
@@ -29,7 +29,7 @@ source("../bin/chunk-options.R")
 **Objective**
 
 - Describe best practices for organizing data so computers can make
-  the best use of data sets.
+  the best use of datasets.
 
 **Keypoint**
 
@@ -101,7 +101,7 @@ ensure efficient downstream analysis.
 >   frustrated or sad?
 {: .challenge}
 
-### Problems with Spreadsheets
+### Problems with spreadsheets
 
 Spreadsheets are good for data entry, but in reality we tend to
 use spreadsheet programs for much more than data entry. We use them
@@ -126,7 +126,7 @@ command-line based statistics program like R or SAS, it’s practically
 impossible to apply a calculation to one observation in your
 dataset but not another unless you’re doing it on purpose.
 
-### Using Spreadsheets for Data Entry and Cleaning
+### Using spreadsheets for data entry and cleaning
 
 In this lesson, we will assume that you are most likely using Excel as
 your primary spreadsheet program - there are others (gnumeric, Calc
@@ -161,7 +161,7 @@ In this lesson we're going to talk about:
 - Keep track of all of the steps you take to clean your data in a
   plain text file.
 
-- Organize your data according to tidy data principles.
+- Organise your data according to tidy data principles.
 
 The most common mistake made is treating spreadsheet programs like lab
 notebooks, that is, relying on context, notes in the margin, spatial
@@ -171,7 +171,7 @@ the same way, and unless we explain to the computer what every single
 thing means (and that can be hard!), it will not be able to see how
 our data fits together.
 
-Using the power of computers, we can manage and analyze data in much
+Using the power of computers, we can manage and analyse data in much
 more effective and faster ways, but to use that power, we have to set
 up our data for the computer to be able to understand it (and
 computers are very literal).
@@ -200,7 +200,7 @@ different from the one you started with. In order to be able to
 reproduce your analyses or figure out what you did when a reviewer or
 instructor asks for a different analysis, you should
 
-- create a new file with your cleaned or analyzed data. Don't modify
+- create a new file with your cleaned or analysed data. Don't modify
   the original dataset, or you will never know where you started!
 
 - keep track of the steps you took in your clean up or analysis. You
@@ -260,9 +260,9 @@ used for variables** and **rows are used for observations**:
 - rows are observations
 - cells are individual values
 
-> ## Challenge: We're going to take a messy data and describe how we would clean it up.
+> ## Challenge: We're going to take a messy dataset and describe how we would clean it up.
 >
-> 1. Download a messy data by clicking
+> 1. Download a messy dataset by clicking
 >    [here](https://github.com/UCLouvain-CBIO/WSBIM1207/raw/master/data/messy_covid.xlsx).
 >
 > 2. Open up the data in a spreadsheet program.
@@ -303,15 +303,15 @@ wrong with this data and how you would fix it.
 > - How many A, AB, and B types have been tested?
 > - As above, but disregarding the contaminated samples?
 > - How many Rhesus + and - have been tested?
-> - How many universal donors (0-) have been tested?
+> - How many universal donors (O-) have been tested?
 > - What is the average weight of AB men?
 > - How many samples have been tested in the different hospitals?
 {: .challenge}
 
 An **excellent reference**, in particular with regard to R scripting
 is the *Tidy Data* paper @Wickham:2014.
 
-## Common Spreadsheet Errors
+## Common spreadsheet errors
 
 **Questions**
 
@@ -320,7 +320,7 @@ is the *Tidy Data* paper @Wickham:2014.
 
 **Objectives**
 
-- Recognize and resolve common spreadsheet formatting problems.
+- Recognise and resolve common spreadsheet formatting problems.
 
 **Keypoints**
 
@@ -384,7 +384,7 @@ samples. Other rows are similarly problematic.
 
 ### Using multiple tabs {#tabs}
 
-But what about workbook tabs? That seems like an easy way to organize
+But what about workbook tabs? That seems like an easy way to organise
 data, right? Well, yes and no. When you create extra tabs, you fail to
 allow the computer to see connections in the data that are there (you
 have to introduce spreadsheet application-specific functions or
@@ -398,7 +398,7 @@ This isn't good practice for two reasons:
    in a new tab, and
 
 2. even if you manage to prevent all inconsistencies from creeping in,
-   you will add an extra step for yourself before you analyze the data
+   you will add an extra step for yourself before you analyse the data
    because you will have to combine these data into a single
    datatable. You will have to explicitly tell the computer how to
    combine tabs - and if the tabs are inconsistently formatted, you
@@ -408,7 +408,7 @@ The next time you’re entering data, and you go to create another tab
 or table, ask yourself if you could avoid adding this tab by adding
 another column to your original spreadsheet. We used multiple tabs in
 our example of a messy data file, but now you've seen how you can
-reorganize your data to consolidate across tabs.
+reorganise your data to consolidate across tabs.
 
 Your data sheet might get very long over the course of the
 experiment. This makes it harder to enter data if you can’t see your
@@ -431,7 +431,7 @@ counted it. A blank cell means that it wasn't measured and the
 computer will interpret it as an unknown value (also known as a null
 or missing value).
 
-The spreadsheets or statistical programs will likely mis-interpret
+The spreadsheets or statistical programs will likely misinterpret
 blank cells that you intend to be zeros. By not entering the value of
 your observation, you are telling your computer to represent that data
 as unknown or missing (null). This can cause problems with subsequent
@@ -450,7 +450,7 @@ missing data as nulls.
 **Solutions**:
 
 There are a few reasons why null values get represented differently
-within a dataset.  Sometimes confusing null values are automatically
+within a dataset. Sometimes confusing null values are automatically
 recorded from the measuring device. If that's the case, there's not
 much you can do, but it can be addressed in data cleaning with a tool
 like
@@ -466,9 +466,9 @@ different reasons.
 Whatever the reason, it's a problem if unknown or missing data is
 recorded as -999, 999, or 0.
 
-Many statistical programs will not recognize that these are intended
+Many statistical programs will not recognise that these are intended
 to represent missing (null) values. How these values are interpreted
-will depend on the software you use to analyze your data. It is
+will depend on the software you use to analyse your data. It is
 essential to use a clearly defined and consistent null indicator.
 
 Blanks (most applications) and NA (for R) are good
@@ -499,7 +499,7 @@ for different software applications in their article:
 aesthetically pleasing can compromise your computer’s ability to see
 associations in the data. Merged cells will make your data unreadable
 by statistics software. Consider restructuring your data in such a way
-that you will not need to merge cells to organize your data.
+that you will not need to merge cells to organise your data.
 
 
 ### Placing comments or units in cells {#units}
@@ -519,9 +519,9 @@ specify the units the cell is in.
 B+, A-, ...
 
 **Solution**: Don't include more than one piece of information in a
-cell. This will limit the ways in which you can analyze your data.  If
+cell. This will limit the ways in which you can analyse your data.  If
 you need both these measurements, design your data sheet to include
-this information. For example, include one column the ABO group and
+this information. For example, include one column for the ABO group and
 one for the Rhesus group.
 
 ### Using problematic field names {#field_name}
@@ -560,15 +560,15 @@ other applications.
 
 **Solution**: This is a common strategy. For example, when writing
 longer text in a cell, people often include line breaks, em-dashes,
-etc in their spreadsheet.  Also, when copying data in from
+etc. in their spreadsheet. Also, when copying data in from
 applications such as Word, formatting and fancy non-standard
 characters (such as left- and right-aligned quotation marks) are
-included.  When exporting this data into a coding/statistical
+included. When exporting this data into a coding/statistical
 environment or into a relational database, dangerous things may occur,
 such as lines being cut in half and encoding errors being thrown.
 
 General best practice is to avoid adding characters such as newlines,
-tabs, and vertical tabs.  In other words, treat a text cell as if it
+tabs, and vertical tabs. In other words, treat a text cell as if it
 were a simple web form that can only contain text and spaces.
 
 
@@ -674,7 +674,7 @@ text files where the columns are separated by commas, hence 'comma
 separated values' or CSV. The advantage of a CSV file over an
 Excel/SPSS/etc. file is that we can open and read a CSV file using
 just about any software, including plain text editors like TextEdit or
-NotePad.  Data in a CSV file can also be easily imported into other
+NotePad. Data in a CSV file can also be easily imported into other
 formats and environments, such as SQLite and R. We're not tied to a
 certain version of a certain expensive program when we work with CSV
 files, so it's a good format to work with for maximum portability and
@@ -703,10 +703,10 @@ different worksheets in the `xls` documents.
 
 **But**
 
-- some of these only work on Windows
+- some of these only work on Windows.
 - this equates to replacing a (simple but manual) export to `csv` with
-  additional complexity/dependencies in the data analysis R code
-- data formatting best practice still apply
+  additional complexity/dependencies in the data analysis R code.
+- data formatting best practice still apply.
 - Is there really a good reason why `csv` (or similar) is not
   adequate?
 
@@ -798,14 +798,14 @@ build relevant scripts.
 knitr::include_graphics("../fig/analysis.png")
 ```
 
-A typical data analysis worflow is illustrated in the figure above,
-where data is repeatedly tranformed, visualised, modelled. This
+A typical data analysis workflow is illustrated in the figure above,
+where data is repeatedly tranformed, visualised, and modelled. This
 iteration is repeated multiple times until the data is understood. In
 many real-life cases, however, most time is spent cleaning up and
 preparing the data, rather than actually analysing and understanding
 it.
 
 An agile data analysis workflow, with several fast iterations of the
-transform/visualise/model cycle is only feasible is the data is
+transform/visualise/model cycle is only feasible if the data is
 formatted in a predictable way and one can reason about the data
 without having to look at it and/or fix it.
diff --git a/_episodes_rmd/20-r-rstudio.Rmd b/_episodes_rmd/20-r-rstudio.Rmd
@@ -7,7 +7,7 @@ questions:
 - "What are R and RStudio?"
 objectives:
 - "Describe the purpose of the RStudio Script, Console, Environment, and Plots panes."
-- "Organize files and directories for a set of analyses as an R project, and understand the purpose of the working directory."
+- "Organise files and directories for a set of analyses as an R project, and understand the purpose of the working directory."
 - "Use the built-in RStudio help interface to search for more information on R functions."
 - "Demonstrate how to provide sufficient information for troubleshooting with the R user community."
 keypoints:
@@ -31,12 +31,12 @@ therefore both need to be installed on your computer.
 
 [^plainr]: As opposed to using R directly from the command line
     console. There exist other software that interface and integrate
-    with R, but RStudio is particularly well suited for beginners and
+    with R, but RStudio is particularly well suited for beginners
     while providing numerous very advanced features.
 
 The [RStudio IDE Cheat
 Sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/rstudio-ide.pdf)
-provides much more information that will be covered here, but can be
+provides much more information than will be covered here, but can be
 useful to learn keyboard shortcuts and discover new features.
 
 ## Why learn R?
@@ -79,7 +79,7 @@ requirements.
 With 10000+ packages[^whatarepkgs] that can be installed to extend its
 capabilities, R provides a framework that allows you to combine
 statistical approaches from many scientific disciplines to best suit
-the analytical framework you need to analyze your data. For instance,
+the analytical framework you need to analyse your data. For instance,
 R has packages for image analysis, GIS, time series, population
 genetics, and a lot more.
 
@@ -116,8 +116,8 @@ your data.
 Thousands of people use R daily. Many of them are willing to help you
 through mailing lists and websites such as [Stack
 Overflow](https://stackoverflow.com/), or on the [RStudio
-community](https://community.rstudio.com/). These broad user community
-extends to specialised areas such as bioinformatics.
+community](https://community.rstudio.com/). These broad user communities
+extend to specialised areas such as bioinformatics.
 
 
 ### Not only is R free, but it is also open-source and cross-platform
@@ -138,7 +138,7 @@ The RStudio IDE is also available with a commercial license and
 priority email support from RStudio, Inc.
 
 We will use the RStudio IDE to write code, navigate the files on our
-computer, inspect the variables we are going to create, and visualize
+computer, inspect the variables we are going to create, and visualise
 the plots we will generate. RStudio can also be used for other things
 (e.g., version control, developing packages, writing Shiny apps) that
 we will not cover during the workshop.
@@ -156,7 +156,7 @@ The RStudio window is divided into 4 "Panes":
 - your **Files/Plots/Packages/Help/Viewer** (bottom-right), and
 - the R **Console** (bottom-left).
 
-The placement of these panes and their content can be customized (see
+The placement of these panes and their content can be customised (see
 menu, `Tools -> Global Options -> Pane Layout`).
 
 One of the advantages of using RStudio is that all the information you
@@ -202,7 +202,7 @@ for 'Save workspace to .RData' on exit.
 knitr::include_graphics("../fig/rstudio-preferences.png")
 ```
 
-To avoid [character encoding issue between Windows and other operating
+To avoid [character encoding issues between Windows and other operating
 systems](https://yihui.name/en/2018/11/biggest-regret-knitr/), we are
 going to set UTF-8 by default:
 
@@ -214,7 +214,7 @@ knitr::include_graphics("../fig/utf8.png")
 ### Organizing your working directory
 
 Using a consistent folder structure across your projects will help keep things
-organized, and will also make it easy to find/file things in the future. This
+organised, and will also make it easy to find/file things in the future. This
 can be especially helpful when you have multiple projects. In general, you may
 create directories (folders) for **scripts**, **data**, and **documents**.
 
@@ -366,7 +366,7 @@ commands, but they will be forgotten when you close the session.
 Because we want our code and workflow to be reproducible, it is better
 to type the commands we want in the script editor, and save the
 script. This way, there is a complete record of what we did, and
-anyone (including our future selves!)  can easily replicate the
+anyone (including our future selves!) can easily replicate the
 results on their computer. Note, however, that merely typing the commands
 in the script does not automatically *run* them - they still need to
 be sent to the console for execution.
@@ -388,7 +388,7 @@ them directly in the console.  RStudio provides the `Ctrl` + `1` and
 console panes.
 
 If R is ready to accept commands, the R console shows a `>` prompt. If
-it receives a command (by typing, copy-pasting or sent from the script
+it receives a command (by typing, copy-pasting or sending from the script
 editor using `Ctrl` + `Enter`), R will try to execute it, and when
 ready, will show the results and come back with a new `>` prompt to
 wait for new commands.
@@ -514,7 +514,7 @@ If possible, try to reduce what doesn't work to a simple *reproducible
 example*. If you can reproduce the problem using a very small data
 frame instead of your 50000 rows and 10000 columns one, provide the
 small one with the description of your problem. When appropriate, try
-to generalize what you are doing so even people who are not in your
+to generalise what you are doing so even people who are not in your
 field can understand the question. For instance instead of using a
 subset of your real dataset, create a small (3 columns, 5 rows)
 generic one. For more information on how to write a reproducible
@@ -600,7 +600,7 @@ sessionInfo()
 
 - [How to ask for R
   help](http://blog.revolutionanalytics.com/2014/01/how-to-ask-for-r-help.html)
-  useful guidelines
+  useful guidelines.
 
 - [This blog post by Jon
   Skeet](http://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question/)
@@ -649,7 +649,7 @@ namely `BiocManager`, that can be installed from CRAN with
 install.packages("BiocManager")
 ```
 
-Individuals package such as `SummarizedExperiment` (we will use it
+Individual packages such as `SummarizedExperiment` (we will use it
 later), `DESeq2` (for RNA-Seq analysis), and many more can then be
 installed with `BiocManager::install`.