Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix typos in Chapters 1-3 #72

Merged
merged 4 commits into from
Oct 7, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 35 additions & 35 deletions _episodes_rmd/10-data-organisation.Rmd
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
---
source: Rmd
title: "Data organisation with Spreadsheets"
title: "Data organisation with spreadsheets"
teaching: 30
exercises: 30
questions:
- "How to organise tabular data?"
objectives:
- "Learn about spreadsheets, their strengths and weaknesses"
- "Learn about spreadsheets, their strengths and weaknesses."
- "How do we format data in spreadsheets for effective data use?"
- "Learn about common spreadsheet errors and how to correct them."
- "Organize your data according to tidy data principles."
- "Learn about text-based spreadsheet formats such as the comma-separated (CSV) or tab-separated formats."
- "Organise your data according to tidy data principles."
- "Learn about text-based spreadsheet formats such as the comma-separated (CSV) or tab-separated (TSV) formats."
keypoints:
- "Good data organization is the foundation of any research project."
---
Expand All @@ -29,7 +29,7 @@ source("../bin/chunk-options.R")
**Objective**

- Describe best practices for organizing data so computers can make
the best use of data sets.
the best use of datasets.

**Keypoint**

Expand Down Expand Up @@ -101,7 +101,7 @@ ensure efficient downstream analysis.
> frustrated or sad?
{: .challenge}

### Problems with Spreadsheets
### Problems with spreadsheets

Spreadsheets are good for data entry, but in reality we tend to
use spreadsheet programs for much more than data entry. We use them
Expand All @@ -126,7 +126,7 @@ command-line based statistics program like R or SAS, it’s practically
impossible to apply a calculation to one observation in your
dataset but not another unless you’re doing it on purpose.

### Using Spreadsheets for Data Entry and Cleaning
### Using spreadsheets for data entry and cleaning

In this lesson, we will assume that you are most likely using Excel as
your primary spreadsheet program - there are others (gnumeric, Calc
Expand Down Expand Up @@ -161,7 +161,7 @@ In this lesson we're going to talk about:
- Keep track of all of the steps you take to clean your data in a
plain text file.

- Organize your data according to tidy data principles.
- Organise your data according to tidy data principles.

The most common mistake made is treating spreadsheet programs like lab
notebooks, that is, relying on context, notes in the margin, spatial
Expand All @@ -171,7 +171,7 @@ the same way, and unless we explain to the computer what every single
thing means (and that can be hard!), it will not be able to see how
our data fits together.

Using the power of computers, we can manage and analyze data in much
Using the power of computers, we can manage and analyse data in much
more effective and faster ways, but to use that power, we have to set
up our data for the computer to be able to understand it (and
computers are very literal).
Expand Down Expand Up @@ -200,7 +200,7 @@ different from the one you started with. In order to be able to
reproduce your analyses or figure out what you did when a reviewer or
instructor asks for a different analysis, you should

- create a new file with your cleaned or analyzed data. Don't modify
- create a new file with your cleaned or analysed data. Don't modify
the original dataset, or you will never know where you started!

- keep track of the steps you took in your clean up or analysis. You
Expand Down Expand Up @@ -260,9 +260,9 @@ used for variables** and **rows are used for observations**:
- rows are observations
- cells are individual values

> ## Challenge: We're going to take a messy data and describe how we would clean it up.
> ## Challenge: We're going to take a messy dataset and describe how we would clean it up.
>
> 1. Download a messy data by clicking
> 1. Download a messy dataset by clicking
> [here](https://github.com/UCLouvain-CBIO/WSBIM1207/raw/master/data/messy_covid.xlsx).
>
> 2. Open up the data in a spreadsheet program.
Expand Down Expand Up @@ -303,15 +303,15 @@ wrong with this data and how you would fix it.
> - How many A, AB, and B types have been tested?
> - As above, but disregarding the contaminated samples?
> - How many Rhesus + and - have been tested?
> - How many universal donors (0-) have been tested?
> - How many universal donors (O-) have been tested?
> - What is the average weight of AB men?
> - How many samples have been tested in the different hospitals?
{: .challenge}

An **excellent reference**, in particular with regard to R scripting
is the *Tidy Data* paper @Wickham:2014.

## Common Spreadsheet Errors
## Common spreadsheet errors

**Questions**

Expand All @@ -320,7 +320,7 @@ is the *Tidy Data* paper @Wickham:2014.

**Objectives**

- Recognize and resolve common spreadsheet formatting problems.
- Recognise and resolve common spreadsheet formatting problems.

**Keypoints**

Expand Down Expand Up @@ -384,7 +384,7 @@ samples. Other rows are similarly problematic.

### Using multiple tabs {#tabs}

But what about workbook tabs? That seems like an easy way to organize
But what about workbook tabs? That seems like an easy way to organise
data, right? Well, yes and no. When you create extra tabs, you fail to
allow the computer to see connections in the data that are there (you
have to introduce spreadsheet application-specific functions or
Expand All @@ -398,7 +398,7 @@ This isn't good practice for two reasons:
in a new tab, and

2. even if you manage to prevent all inconsistencies from creeping in,
you will add an extra step for yourself before you analyze the data
you will add an extra step for yourself before you analyse the data
because you will have to combine these data into a single
datatable. You will have to explicitly tell the computer how to
combine tabs - and if the tabs are inconsistently formatted, you
Expand All @@ -408,7 +408,7 @@ The next time you’re entering data, and you go to create another tab
or table, ask yourself if you could avoid adding this tab by adding
another column to your original spreadsheet. We used multiple tabs in
our example of a messy data file, but now you've seen how you can
reorganize your data to consolidate across tabs.
reorganise your data to consolidate across tabs.

Your data sheet might get very long over the course of the
experiment. This makes it harder to enter data if you can’t see your
Expand All @@ -431,7 +431,7 @@ counted it. A blank cell means that it wasn't measured and the
computer will interpret it as an unknown value (also known as a null
or missing value).

The spreadsheets or statistical programs will likely mis-interpret
The spreadsheets or statistical programs will likely misinterpret
blank cells that you intend to be zeros. By not entering the value of
your observation, you are telling your computer to represent that data
as unknown or missing (null). This can cause problems with subsequent
Expand All @@ -450,7 +450,7 @@ missing data as nulls.
**Solutions**:

There are a few reasons why null values get represented differently
within a dataset. Sometimes confusing null values are automatically
within a dataset. Sometimes confusing null values are automatically
recorded from the measuring device. If that's the case, there's not
much you can do, but it can be addressed in data cleaning with a tool
like
Expand All @@ -466,9 +466,9 @@ different reasons.
Whatever the reason, it's a problem if unknown or missing data is
recorded as -999, 999, or 0.

Many statistical programs will not recognize that these are intended
Many statistical programs will not recognise that these are intended
to represent missing (null) values. How these values are interpreted
will depend on the software you use to analyze your data. It is
will depend on the software you use to analyse your data. It is
essential to use a clearly defined and consistent null indicator.

Blanks (most applications) and NA (for R) are good
Expand Down Expand Up @@ -499,7 +499,7 @@ for different software applications in their article:
aesthetically pleasing can compromise your computer’s ability to see
associations in the data. Merged cells will make your data unreadable
by statistics software. Consider restructuring your data in such a way
that you will not need to merge cells to organize your data.
that you will not need to merge cells to organise your data.


### Placing comments or units in cells {#units}
Expand All @@ -519,9 +519,9 @@ specify the units the cell is in.
B+, A-, ...

**Solution**: Don't include more than one piece of information in a
cell. This will limit the ways in which you can analyze your data. If
cell. This will limit the ways in which you can analyse your data. If
you need both these measurements, design your data sheet to include
this information. For example, include one column the ABO group and
this information. For example, include one column for the ABO group and
one for the Rhesus group.

### Using problematic field names {#field_name}
Expand Down Expand Up @@ -560,15 +560,15 @@ other applications.

**Solution**: This is a common strategy. For example, when writing
longer text in a cell, people often include line breaks, em-dashes,
etc in their spreadsheet. Also, when copying data in from
etc. in their spreadsheet. Also, when copying data in from
applications such as Word, formatting and fancy non-standard
characters (such as left- and right-aligned quotation marks) are
included. When exporting this data into a coding/statistical
included. When exporting this data into a coding/statistical
environment or into a relational database, dangerous things may occur,
such as lines being cut in half and encoding errors being thrown.

General best practice is to avoid adding characters such as newlines,
tabs, and vertical tabs. In other words, treat a text cell as if it
tabs, and vertical tabs. In other words, treat a text cell as if it
were a simple web form that can only contain text and spaces.


Expand Down Expand Up @@ -674,7 +674,7 @@ text files where the columns are separated by commas, hence 'comma
separated values' or CSV. The advantage of a CSV file over an
Excel/SPSS/etc. file is that we can open and read a CSV file using
just about any software, including plain text editors like TextEdit or
NotePad. Data in a CSV file can also be easily imported into other
NotePad. Data in a CSV file can also be easily imported into other
formats and environments, such as SQLite and R. We're not tied to a
certain version of a certain expensive program when we work with CSV
files, so it's a good format to work with for maximum portability and
Expand Down Expand Up @@ -703,10 +703,10 @@ different worksheets in the `xls` documents.

**But**

- some of these only work on Windows
- some of these only work on Windows.
- this equates to replacing a (simple but manual) export to `csv` with
additional complexity/dependencies in the data analysis R code
- data formatting best practice still apply
additional complexity/dependencies in the data analysis R code.
- data formatting best practice still apply.
- Is there really a good reason why `csv` (or similar) is not
adequate?

Expand Down Expand Up @@ -798,14 +798,14 @@ build relevant scripts.
knitr::include_graphics("../fig/analysis.png")
```

A typical data analysis worflow is illustrated in the figure above,
where data is repeatedly tranformed, visualised, modelled. This
A typical data analysis workflow is illustrated in the figure above,
where data is repeatedly tranformed, visualised, and modelled. This
iteration is repeated multiple times until the data is understood. In
many real-life cases, however, most time is spent cleaning up and
preparing the data, rather than actually analysing and understanding
it.

An agile data analysis workflow, with several fast iterations of the
transform/visualise/model cycle is only feasible is the data is
transform/visualise/model cycle is only feasible if the data is
formatted in a predictable way and one can reason about the data
without having to look at it and/or fix it.
30 changes: 15 additions & 15 deletions _episodes_rmd/20-r-rstudio.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ questions:
- "What are R and RStudio?"
objectives:
- "Describe the purpose of the RStudio Script, Console, Environment, and Plots panes."
- "Organize files and directories for a set of analyses as an R project, and understand the purpose of the working directory."
- "Organise files and directories for a set of analyses as an R project, and understand the purpose of the working directory."
- "Use the built-in RStudio help interface to search for more information on R functions."
- "Demonstrate how to provide sufficient information for troubleshooting with the R user community."
keypoints:
Expand All @@ -31,12 +31,12 @@ therefore both need to be installed on your computer.

[^plainr]: As opposed to using R directly from the command line
console. There exist other software that interface and integrate
with R, but RStudio is particularly well suited for beginners and
with R, but RStudio is particularly well suited for beginners
while providing numerous very advanced features.

The [RStudio IDE Cheat
Sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/rstudio-ide.pdf)
provides much more information that will be covered here, but can be
provides much more information than will be covered here, but can be
useful to learn keyboard shortcuts and discover new features.

## Why learn R?
Expand Down Expand Up @@ -79,7 +79,7 @@ requirements.
With 10000+ packages[^whatarepkgs] that can be installed to extend its
capabilities, R provides a framework that allows you to combine
statistical approaches from many scientific disciplines to best suit
the analytical framework you need to analyze your data. For instance,
the analytical framework you need to analyse your data. For instance,
R has packages for image analysis, GIS, time series, population
genetics, and a lot more.

Expand Down Expand Up @@ -116,8 +116,8 @@ your data.
Thousands of people use R daily. Many of them are willing to help you
through mailing lists and websites such as [Stack
Overflow](https://stackoverflow.com/), or on the [RStudio
community](https://community.rstudio.com/). These broad user community
extends to specialised areas such as bioinformatics.
community](https://community.rstudio.com/). These broad user communities
extend to specialised areas such as bioinformatics.


### Not only is R free, but it is also open-source and cross-platform
Expand All @@ -138,7 +138,7 @@ The RStudio IDE is also available with a commercial license and
priority email support from RStudio, Inc.

We will use the RStudio IDE to write code, navigate the files on our
computer, inspect the variables we are going to create, and visualize
computer, inspect the variables we are going to create, and visualise
the plots we will generate. RStudio can also be used for other things
(e.g., version control, developing packages, writing Shiny apps) that
we will not cover during the workshop.
Expand All @@ -156,7 +156,7 @@ The RStudio window is divided into 4 "Panes":
- your **Files/Plots/Packages/Help/Viewer** (bottom-right), and
- the R **Console** (bottom-left).

The placement of these panes and their content can be customized (see
The placement of these panes and their content can be customised (see
menu, `Tools -> Global Options -> Pane Layout`).

One of the advantages of using RStudio is that all the information you
Expand Down Expand Up @@ -202,7 +202,7 @@ for 'Save workspace to .RData' on exit.
knitr::include_graphics("../fig/rstudio-preferences.png")
```

To avoid [character encoding issue between Windows and other operating
To avoid [character encoding issues between Windows and other operating
systems](https://yihui.name/en/2018/11/biggest-regret-knitr/), we are
going to set UTF-8 by default:

Expand All @@ -214,7 +214,7 @@ knitr::include_graphics("../fig/utf8.png")
### Organizing your working directory

Using a consistent folder structure across your projects will help keep things
organized, and will also make it easy to find/file things in the future. This
organised, and will also make it easy to find/file things in the future. This
can be especially helpful when you have multiple projects. In general, you may
create directories (folders) for **scripts**, **data**, and **documents**.

Expand Down Expand Up @@ -366,7 +366,7 @@ commands, but they will be forgotten when you close the session.
Because we want our code and workflow to be reproducible, it is better
to type the commands we want in the script editor, and save the
script. This way, there is a complete record of what we did, and
anyone (including our future selves!) can easily replicate the
anyone (including our future selves!) can easily replicate the
results on their computer. Note, however, that merely typing the commands
in the script does not automatically *run* them - they still need to
be sent to the console for execution.
Expand All @@ -388,7 +388,7 @@ them directly in the console. RStudio provides the `Ctrl` + `1` and
console panes.

If R is ready to accept commands, the R console shows a `>` prompt. If
it receives a command (by typing, copy-pasting or sent from the script
it receives a command (by typing, copy-pasting or sending from the script
editor using `Ctrl` + `Enter`), R will try to execute it, and when
ready, will show the results and come back with a new `>` prompt to
wait for new commands.
Expand Down Expand Up @@ -514,7 +514,7 @@ If possible, try to reduce what doesn't work to a simple *reproducible
example*. If you can reproduce the problem using a very small data
frame instead of your 50000 rows and 10000 columns one, provide the
small one with the description of your problem. When appropriate, try
to generalize what you are doing so even people who are not in your
to generalise what you are doing so even people who are not in your
field can understand the question. For instance instead of using a
subset of your real dataset, create a small (3 columns, 5 rows)
generic one. For more information on how to write a reproducible
Expand Down Expand Up @@ -600,7 +600,7 @@ sessionInfo()

- [How to ask for R
help](http://blog.revolutionanalytics.com/2014/01/how-to-ask-for-r-help.html)
useful guidelines
useful guidelines.

- [This blog post by Jon
Skeet](http://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question/)
Expand Down Expand Up @@ -649,7 +649,7 @@ namely `BiocManager`, that can be installed from CRAN with
install.packages("BiocManager")
```

Individuals package such as `SummarizedExperiment` (we will use it
Individual packages such as `SummarizedExperiment` (we will use it
later), `DESeq2` (for RNA-Seq analysis), and many more can then be
installed with `BiocManager::install`.

Expand Down
Loading