forked from compgenomr/book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path09-chip-seq-analysis.Rmd
2734 lines (2107 loc) · 113 KB
/
09-chip-seq-analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# ChIP-seq analysis {#chipseq}
```{r setup_chip_seq, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
message = FALSE,
error = FALSE,
cache = TRUE,
warning = FALSE,
out.width = "50%",
fig.width = 5,
fig.align = 'center')
```
*Chapter Author*: **Vedran Franke**
Protein-DNA interactions are responsible for a large part of the gene expression regulation. Proteins such as transcription factors as well as histones are directly related to how much and in which contexts the genes are expressed. Some of these concepts are already introduced in Chapter \@ref(intro) if readers need a more in-depth introduction. In this chapter, we will introduce how to process and analyze ChIP-seq data in order to identify genome-wide protein binding sites and to discover underlying sequence context via transcription factor binding-site motifs.
## Regulatory protein-DNA interactions
One of the most fascinating biological phenomena is the fact that a myriad of different cell types, in a multicellular organism, are encoded by one single genome. How exactly this is achieved is still a
major unanswered question in biology.
Cell types differ based on a multitude of features:
their size, shape, mobility, surface receptors, metabolic content.
However, the main predominant feature, which influences all of the above, is which
genes are expressed in each cell type.
Therefore, if we can understand what controls which
genes will be expressed, and where they will be expressed,
we can start forming a picture of how a single genomic template,
can give rise to a complex organism.
As explained in Chapter \@ref(intro), gene expression is controlled by a special class of genes called
transcription factors - genes which control other genes.
Transcription factor genes encode proteins which
can bind to the DNA, and control whether a certain part of DNA will be
transcribed (expressed), or stay silent (repressed).
They program the expression patterns in each cell.
Transcription factors contain DNA binding domains, which are specifically folded protein sequences
which recognize specific DNA motifs (a short nucleotide sequence).
Such sequence binding imparts transcription factors with specificity,
transcription factors do not bind everywhere on the DNA, rather they are localized to
short stretches which contain the corresponding DNA motif.
DNA in the nucleus is wrapped around a protein complex called the histone complex.
Histones form a chain of beads along the DNA. By changing their position, histones can make
certain parts of the DNA more or less accessible to transcription
factors. Histone complexes can be chemically modified with different post-translational modifications (see Chapter \@ref(intro)). Such modifications change histone
mobility, and their interactions with different proteins, thereby creating
an additional regulatory layer on top of the DNA sequence.
In order to understand the target genes of a certain transcription factor,
and how they control the gene expression, we need to know where on the DNA the
transcription factor is located.
## Measuring protein-DNA interactions with ChIP-seq
ChIP-seq stands for chromatin immunoprecipitation followed by sequencing, and is an experimental method for finding locations on DNA which are bound by proteins. It has been extensively used to study
in-vivo binding preferences of transcription factors, and genomic distribution of modified histones.
In the remainder of this chapter, you will learn how to assess quality control
of ChIP-seq data sets, perform peak calling to find bound regions, and
assess the quality of the peak calling.
Once you have obtained peaks, you will learn how to perform sequence analysis
to construct motif models, and compare signals between experiments.
Biological experiments often contain a multitude of consecutive steps. Each
step can profoundly influence the quality of the data, and the subsequent analysis.
The computational biologist has to have an in-depth knowledge of the experimental
design, and the underlying experimental steps, in order to choose the proper tools
and the type of analysis, which will give proper and correct results [@kharchenko_2008; @kidder_2011; @landt_2012; @chen_2012; @felsani_2015].
In this chapter we will go through the main experimental steps in the
ChIP-seq analysis and address the most common experimental pitfalls.
The main principle of the method is to use a specific antibody to enrich
DNA fragments which are bound by the protein of interest.
The DNA fragments are then sequenced, mapped onto the corresponding
reference genome, and computationally analyzed to distinguish regions which
were really bound by the protein, from the background regions.
The experimental methodology is depicted in
Figure \@ref(fig:ChIP-seq-Protocol-plot), and consists of the following steps:
1. Cross linking of cells with formaldehyde to bind the proteins to the DNA.
This process covalently links the proteins to the DNA.
2. Fragmentation of DNA using sonication or enzymatic digestion, shearing
of DNA into small fragments (ranging from 50 - 500 bp).
3. Immunoprecipitation using a specific antibody. An immunoprecipitation step
which enriches fragments bound by the protein.
4. Cross-link reversal. Frees the DNA fragments for further processing.
5. Size selection of DNA fragments. Only fragments of certain length are used in the
library preparation and sequencing.
6. Fragment amplification using PCR. The amount of DNA is a limiting step for the
protocol. Therefore the fragments need to be amplified using PCR.
7. DNA fragment sequencing
```{r ChIP-seq-Protocol-plot, echo=FALSE, include=TRUE, fig.cap="Main experimental steps in the ChIP-seq protocol."}
knitr::include_graphics('./Figures/Chip-Seq_Protocol_Extended.png')
```
After sequencing, the role of the computational biologist is to assess the
quality of the experiment, find the location of the protein of interest, and
finally, to integrate with existing data sets.
Each step of the experimental protocol can affect the quality
of the data set, and the subsequent analysis steps. It is, therefore, of crucial
importance to perform quality control for every sequenced experiment.
## Factors that affect ChIP-seq experiment and analysis quality
### Antibody specificity
Antibody specificity is a term which refers to how strongly an antibody
binds to its preferred target, with respect to everything else present in the cell.
It is the paramount measure influencing the successful execution of a ChIP
experiment.
Antibodies can bind multiple proteins with the same affinity.
This is called antibody cross-reactivity. If an antibody cross-reacts with
multiple proteins, the results of a ChIP experiment will be ambiguous.
Instead of finding where our protein binds to the DNA, we will get a
superposition of binding of multiple proteins.
Such data are impossible to analyze correctly, and will produce false conclusions.
There are many experimental procedures for validating antibody specificities, and
an antibody should pass multiple tests in order to be considered valid.
The exact recommendations are listed by the ENCODE consortium [@landt_2012].
Every time we are analyzing a new ChIP-seq experiment, we have to take our time
to convince ourselves that all of the appropriate experimental controls were performed
to validate the antibody specificity [@Wardle_2015].
### Sequencing depth
Variation in sequencing depth is the first systematic technical bias we
encounter in ChIP-seq experiments. Namely, different samples will contain different number of
sequenced reads. Different sequencing depth influences our ability to detect
enriched regions, and complicates comparisons between samples [@jung_2014].
The statistical procedure of removing the influence of sequencing depth on the
quantification is called depth scaling; we calculate a scaling factor which
is used to multiply the signal strength before the comparison.
There are multiple methods for normalization, and each method comes with its assumptions.
**Scale normalization** is done by dividing the read counts (in certain genomic locations)
by the total amount of sequenced reads. This method presumes that the ChIP
efficiency worked equally well in all studied conditions. Because the ChIP efficiency
differs in different antibodies, it is often unsuitable for comparisons of ChIP-seq
experiments done on different proteins.
**Robust normalization** tries to locate genomic regions which do not change between
different biological conditions (regions where the protein is constantly bound),
and then uses the sum of the reads in those regions as the scaling factor. This
method presumes that we can reliably identify regions which do not change [@shao_2012].
**Background normalization** presumes that the genome can be split into two categories:
background regions and true signal regions. It then uses the number of reads in the
background regions to define the scaling factor [@liang_2012].
**External normalization** uses external reference for normalization; we
add known amounts of chromatin from a distant species, or artificial spike-ins which are then
used as a scaling reference. This is used when we think there are global changes
in the biding profiles between two biological conditions -- very large changes in the
signal profile [@bonhoure_2014].
The choice of normalization method depends on the type of analysis [@angelini_2015];
if we want to quantitatively compare the abundance of different histone marks in
different cell types, we will need the different normalization procedure than if
we want to compare TF binding in the same setting.
### PCR duplication
The amounts of the DNA obtained after the ChIP experiment are quite often lower
than the minimal amount which can be sequenced.
Polymerase chain reaction (PCR) is a procedure used for amplification of DNA fragments.
It is used to increase the amount of DNA in our sample prior to sequencing.
PCR is a stochastic procedure, meaning that the results of each PCR
reaction cannot be predicted. Due to its stochastic nature, PCR can
be a significant source of variability in the ChIP-seq experiments
[@aird_2011; @benjamini_2012; @teng_2016]. A quality control is necessary to
check whether all of our samples have the same sequence properties, i.e. the
same enrichment of dinucleotides, such as CpG.
If the samples differ in their sequence properties, that means we have
to account for them during the analysis [@Teng_2017].
### Biological replicates
Biological replicates are independently executed ChIP-seq experiments from different
samples, corresponding to the same biological conditions. They are indispensable
for estimating ChIP quality, and give us an estimate of the variability in the
experiment which we can expect due to unknown biological variables.
Without biological replicates, it is statistically impossible to compare
ChIP-seq samples from different biological conditions, because we do not know
whether the observed changes are a result of the inherent biological variability
(the source of which we do not understand), or they result from the change
in the biological condition (different tissue or transcription factor used
in the experiment).
If we encounter an experimental setup which does not include biological
replicates, we should be extremely skeptical about all conclusions derived
from such analysis.
### Control experiments\index{Peak calling}
There are three types of control experiments which can be performed to control
for known and unknown experimental biases:
1. **Input control**: Sequencing of genomic DNA without the immunoprecipitation step.
2. **IgG control**: Using a polyclonal mixture of non-specific IgG antibodies instead of a specific antibody.
3. **Knockout control**: Performing the ChIP experiment in a biological system which
does not contains our protein of interest (i.e. in a cell line where the transcription
factor was knocked out) [@krebs_2014].
Each type of control experiment controls for a certain set of experimental biases.
**Input control** is the most frequent type of control performed.
It shows the differential susceptibility of genomic regions to the ChIP-procedure.
Due to the hierarchical structure of chromatin, different genomic regions have different propensities for cross-linking,
sonication, and immunoprecipitation. This causes an uneven probability of
observing DNA fragments originating from different genomic regions.
Because different cell types (cell lines, and cancer cell lines),
have different chromatin structure, ChIP samples will show a cell-type-specific
bias in observed enrichment profiles.
An important note to consider is that the input control is basically a
reduced whole genome sequencing experiment, while the ChIP enriches for only a subset
of genomic regions. If both ChIP and Input samples are sequenced to the same
depth (same number of reads), the background distribution in the input sample will
be under sampled. It is recommended to sequence the input sample deeper than the ChIP sample [@chen2012systematic].
**IgG control** uses a soup of nonspecific antibodies to control for background
binding. In principle, the antibodies should be isolated from the same batch of
serum which was used to create the specific antibody (used for ChIP). It should,
in theory, give a background profile of non-specific binding.
The proper control, is however, seldom available. Additionally, because the antibodies
are unspecific, the amount of precipitated DNA will be low, and the samples
will require additional rounds of PCR amplification.
**KO control** is a ChIP experiment performed in the biological system where
the native protein is not present. Such an experiment profiles the non-specific
binding of the antibody to other proteins, and directly to the DNA.
The primary, and only, concern is that the perturbation caused by the knock-out (or knock-down),
changes the cell so much, that the ChIP profile is not comparable to the original cell.
This is the most accurate type of control experiment, however, it is frequently technically challenging
to perform if the cells are not viable after the knock-out, or
if the knock-out is impossible to perform.
### Using tagged proteins
If an antibody of sufficient quality is not available, it is possible
to resort to constructs where the protein of interest gets engineered
with a ChIP-able tag. The proper control for such experiments is to perform
the ChIP in the cell line containing the engineered protein, and without the
protein.
It must be noted that the tagging procedure can change the binding preferences
of the protein, and therefore the experimental conclusions.
## Pre-processing ChIP data
The focus of ChIP preprocessing is to check the quality of the sequencing
experiment, remove sequencing artifacts, and find the genomic location of
sequenced fragments using read mapping.
The quality control consists of read quality control and adapter trimming.
These methods are described in depth in Chapter \@ref(processingReads).
### Mapping of ChIP-seq data
Mapping is a procedure of trying to locate the exact genomic location which
created each genomic fragment, each sequenced read.
Several tools are available for mapping ChIP-seq data sets:
Bowtie, Bowtie2, BWA [@langmead_2009; @langmead_2012; @li_2009], and all of them have comparable sensitivity and
specificity [@ruffalo_2011].
Read length is the variable with the biggest effect on the mapping procedure.
The longer the sequenced reads, the more uniquely can the read be assigned
to a position on the genome. Reads which are assigned ambiguously to multiple
locations in the genome are called multi-mapping reads. Such fragments
are most often produced by repetitive genomic regions, such as retrotransposons,
pseudogenes or paralogous genes [@li_2014].
It is important to, a priori, decide whether such duplicated regions are of
interest for the current experimental setup (i.e. whether we want to study transcription factor binding
in olfactory receptors). If they are, then the multi-mapping
reads should be included in the analysis. If they are not, they should be omitted.
This is done during the mapping step, by limiting the number of locations to which a read can map.
The methodology of working with multi-mapping reads differs according to the
use case, and will not be considered in this chapter. For more information, please
see the references [@chung_2011].
Current Illumina sequencing procedures enable sequencing of DNA
fragments from just one, or both ends.
Sequencing from both ends is called **paired-end** sequencing and greatly enhances
the sample **mappability**, the percentage of genome which can be uniquely mapped.
Additionally, it provides an out-of-the-box estimate of the
average DNA fragment length, a parameter which is important for quality control
and peak calling.
Although it would always be preferable to do paired-end sequencing it
substantially increases the sequencing costs, which can be prohibitive.
Different reads, which map to the same genomic location (same chromosome, position, and strand),
are called **duplicated reads**. Such reads are an indication that the same
DNA fragment was present multiple times during the library preparation. This
can happen due to high enrichment with highly specific antibodies, or such
fragments can be artificially produced during PCR amplification. Because we do not know
the exact origin of the duplicated fragments, they are most often collapsed during
the peak calling procedure, i.e. when multiple reads map to the same chromosome,
position, and strand, only one read is used. If the transcription factor
binds to a small number of regions in the genome, such data reduction might be
too stringent, and we can increase the sensitivity by allowing up to __N__ different
reads, per position (i.e. if more than __N__ reads map to the same location, only
__N__ reads are kept for the downstream analysis).
Some peak calling algorithms have automated statistical methods for determining the number
of reads, per position, which will be used in the analysis [@zhang_2008].
An important consideration to take into account is the genome which was used in
the experiment. Cell lines, cancer samples, and personal genomes usually contain
structural genomic alterations which are not present in the reference genome
(duplications, insertions, and deletions). Such regions can cause false
negatives, and false positives in the ChIP-seq experiment.
If a region was present multiple times in the experimental system, and only a single
time in the reference genome, it will be relatively enriched in the final
sequencing library. Such fragments will pile up on a single location during
the mapping step, and create an artificial peak, which can be falsely characterized
as a binding event.
Such regions are called **blacklisted** regions and should be removed from the
downstream analysis. The [UCSC browser database](http://genome.ucsc.edu) contains tables with
such regions for the most commonly used model organism species.
This chapter presumes that the user is already familiar with
the following technical and conceptual knowledge in computational data processing.
From Chapters \@ref(processingReads) and \@ref(genomicIntervals), you should be familiar with \index{read filtering}
\index{read mapping} the concept of multi-mapping reads,
and the following file formats BED\index{BED file}, GTF\index{GTF file},
WIG\index{WIG file}, bigWig\index{bigWig file}, BAM\index{BAM file}.
You should also be familiar with PCR\index{PCR}, what
are PCR duplicates, positive and negative DNA strands, and technical and biological
replicates.
## ChIP quality control
While the goal of the read quality assessment is to check whether the sequencing
produced a high enough number of high-quality reads
the goal of ChIP quality control is to ascertain whether the chromatin immunoprecipitation
enrichment was successful.
This is a crucial step in the ChIP-seq analysis because it can help us
identify low-quality ChIP samples, and give information about which experimental
steps went wrong.
There are four steps in ChIP quality control:
1. Sample correlation clustering: Clustering of the pair-wise correlations between
genome-wide signal profiles.
2. Data visualization in a genomic browser.
3. Average fragment length determination: Determining whether the ChIP was enriched for fragments of a certain length.
4. Visualization of GC bias. Here we will plot the ChIP enrichment versus the
average GC content in the corresponding genomic bin.
### The data
Here we will familiarize ourselves with the datasets that will be used in the
chapter.
Experimental data was downloaded from the public ENCODE [@ENCODE_Project_Consortium2012-wf]
database of ChIP-seq experiments.
The experiments were performed on a lymphoblastoid cell line, GM12878, and mapped
to the GRCh38 (hg38) version of the human genome, using the standard ENCODE
ChIP-seq pipeline. In this chapter, due to compute time considerations, we have taken a subset of the data which corresponds to the human chromosome 21 (chr21).
The data sets are located in the `compGenomRData`\index{R Packages!\texttt{compGenomRData}} package.
The location of the data sets can be accessed using the `system.file()` command,
in the following way:
```{r data.path}
data_path = system.file('extdata/chip-seq',package='compGenomRData')
```
The available datasets can be listed using the `list.files()` function:
```{r load.data, echo=TRUE, eval=TRUE, include=TRUE}
chip_files = list.files(data_path, full.names=TRUE)
```
```{r data.path.show, echo=FALSE, eval=FALSE, include=TRUE}
head(chip_files)
```
The dataset consists of the following ChIP experiments:
1. **Transcription factors**: CTCF\index{CTCF protein}, SMC3, ZNF143, PolII
(RNA polymerase 2)
2. **Histone modifications**\index{histone modification}: H3K4me3, H3K36me3, H3k27ac, H3k27me3
3. Various input samples
### Sample clustering
Clustering is an ordering procedure which groups samples by similarity;
the more similar samples are grouped closer to one another.
The details of clustering methodologies are described in Chapter \@ref(unsupervisedLearning).
Clustering of ChIP signal profiles is used for two purposes:
The first one is to ascertain whether there is concordance between
biological replicates; biological replicates should show greater similarity
than ChIP of different proteins. The second function is to see whether our experiments conform to known prior knowledge. For example, we would expect to see greater similarity between proteins
which belong to the same protein complex.
To quantify the ChIP signal we will firstly construct 1-kilobase-wide tilling
windows over the genome, and subsequently count the number of reads
in each window, for each experiment. We will then normalize the counts, to
account for a different total number of reads in each experiment, and finally
calculate the correlation between all pairs of samples.
Although this procedure represents a crude way of data quantification, it provides sufficient
information to ascertain the data quality.
Using the `GenomeInfoDb` we will first fetch the chromosome lengths corresponding
to the hg38 version of the human genome, and filter the length for human
chromosome 21.
```{r sample-clustering.seqlen, message=FALSE, warning=FALSE}
# load the chromosome info package
library(GenomeInfoDb)
# fetch the chromosome lengths for the human genome
hg_chrs = getChromInfoFromUCSC('hg38')
# find the length of chromosome 21
hg_chrs = subset(hg_chrs, grepl('chr21$',chrom))
```
The `tileGenome()` function from the `GenomicRanges` package constructs equally sized
windows over the genome of interest.
The function takes two arguments:
1. A vector of chromosome lengths
2. Window size
Firstly, we convert the chromosome lengths _data.frame_ into a _named vector_.
```{r sample-clustering.seqlen_vector}
# downloaded hg_chrs is a data.frame object,
# we need to convert the data.frame into a named vector
seqlengths = with(hg_chrs, setNames(size, chrom))
```
Then we construct the windows.
```{r sample-clustering.tilling_window}
# load the genomic ranges package
library(GenomicRanges)
# tileGenome function returns a list of GRanges of a given width,
# spanning the whole chromosome
tilling_window = tileGenome(seqlengths, tilewidth=1000)
# unlist converts the list to one GRanges object
tilling_window = unlist(tilling_window)
```
```{r sample-clustering.show_tilling_window, include=TRUE, echo=FALSE, eval=TRUE}
tilling_window
```
We will use the `summarizeOverlaps()` function from the `GenomicAlignments` package
to count the number of reads in each genomic window.
The function will do the counting automatically for all our experiments.
The `summarizeOverlaps()` function returns a `SummarizedExperiment` object.
The object contains the counts, genomic ranges which were used for the quantification,
and the sample descriptions.
```{r sample-clustering.GenomicAlignments, warning=FALSE, message=FALSE}
# load GenomicAlignments
library(GenomicAlignments)
# fetch bam files from the data folder
bam_files = list.files(
path = data_path,
full.names = TRUE,
pattern = 'bam$'
)
# use summarizeOverlaps to count the reads
so = summarizeOverlaps(tilling_window, bam_files)
# extract the counts from the SummarizedExperiment
counts = assays(so)[[1]]
```
Different ChIP experiments were sequenced to different depths; each experiment
contains a different number of reads. To remove the effect of the experimental
depth on the quantification, the samples need to be normalized.
The standard normalization procedure, for ChIP data, is to divide the
counts in each tilling window by the total number of sequenced reads, and
multiply it by a constant factor (to avoid extremely small numbers).
This normalization procedure is called the **cpm**\index{cpm} - counts per million.
\[
CPM = counts * (10^{6} / total\>number\>of\>reads)
\]
```{r sample-clustering.norm}
# calculate the cpm from the counts matrix
# the following command works because
# R calculates everything by columns
cpm = t(t(counts)*(1000000/colSums(counts)))
```
We remove all tiles which do not have overlapping reads.
Tiles with 0 counts do not provide any additional discriminatory power, rather,
they introduce artificial similarity between the samples (i.e. samples with
only a handful of bound regions will have a lot of tiles with $0$ counts, while
they do not have to have any overlapping enriched tiles).
```{r sample-clustering.filter_cpm}
# remove all tiles which do not contain reads
cpm = cpm[rowSums(cpm) > 0,]
```
We use the `sub()` function to shorten the column names of the cpm matrix.
```{r sample-clustering.change_colnames}
# change the formatting of the column names
# remove the .chr21.bam suffix
colnames(cpm) = sub('.chr21.bam','', colnames(cpm))
# remove the GM12878_hg38 prefix
colnames(cpm) = sub('GM12878_hg38_','',colnames(cpm))
```
```{r sample-clustering.colnames, include=FALSE, eval=TRUE, echo=TRUE}
colnames(cpm)
```
Finally, we calculate the pairwise Pearson correlation coefficient using the
`cor()` function.
The function takes as input a region-by-sample count matrix, and returns
a sample X sample matrix, where each field contains the correlation coefficient
between two samples.
```{r sample-clustering.correlation}
# calculates the pearson correlation coefficient between the samples
correlation_matrix = cor(cpm, method='pearson')
```
The `Heatmap()` function from the `ComplexHeatmap` [@Gu_2016] package is used to visualize
the correlation coefficient.\index{R Packages!\texttt{ComplexHeatmap}}
The function automatically performs hierarchical clustering - it groups the
samples which have the highest pairwise correlation.
The diagonal represents the correlation of each sample with itself.
```{r sample-clustering-complex-heatmap, fig.cap='Heatmap showing ChIP-seq sample similarity using the Pearson correlation coefficient.'}
# load ComplexHeatmap
library(ComplexHeatmap)
# load the circlize package, and define
# the color palette which will be used in the heatmap
library(circlize)
heatmap_col = circlize::colorRamp2(
breaks = c(-1,0,1),
colors = c('blue','white','red')
)
# plot the heatmap using the Heatmap function
Heatmap(
matrix = correlation_matrix,
col = heatmap_col
)
```
In Figure \@ref(fig:sample-clustering-complex-heatmap) we can see a
perfect example of why quality control is important.
**CTCF** is a zinc finger protein which co-localizes with the Cohesin complex.
**SMC3** is a sub unit of the Cohesin complex, and we would therefore expect to
see that the **SMC3** signal profile has high correlation with the **CTCF** signal profile.
This is true for the second biological replicate of **SMC3**, while the first
replicate (SMC3_r1) clusters with the input samples. This indicates that the
sample likely has low enrichment.
We can see that the ChIP and Input samples form separate clusters. This implies
that the ChIP samples have an enrichment of fragments.
Additionally, we see that the biological replicates of other experiments
cluster together.
### Visualization in the genome browser
One of the first steps in any ChIP-seq analysis should be looking at the
data. By looking at the data we get an intuition about the quality of the
experiment, and start seeing preliminary correlations between the samples, which
we can use to guide our analysis.
This can be achieved either by plotting signal profiles around
regions of interest, or by loading data into a genome browser
(such as IGV\index{IGV Browser}, or UCSC genome browsers\index{UCSC Genome Browser}).
Genome browsers are standalone applications which represent the genome
as a one-dimensional (1D) coordinate system. The browsers enable
simultaneous visualization and comparison of multiple types of annotations and experimental data.
Genome browsers can visualize most of the commonly used genomic data formats:
BAM\index{BAM file}, BED\index{BED file}, wig\index{wig file}, and bigWig\index{bigWig file}.
The easiest way to access our data would be to load the .bam files into the browser. This will show us the sequence and position of every mapped read. If we want to view multiple samples in parallel, loading every mapped read can be restrictive. It takes up a lot of computational resources, and the amount of information
makes the visual comparison hard to do.
We would like to convert our data so that we get a compressed visualization,
which would show us the main properties of our samples, namely, the quality and
the location of the enrichment.
This is achieved by summarizing the read enrichment into a signal profile -
the whole experiment is converted into a numeric vector - a coverage vector.
The vector contains information on how many reads overlap each position
in the genome.
We will proceed as follows: Firstly, we will import a **.bam** file into **R**. Then we will calculate the signal profile (construct the coverage vector), and finally, we export the vector as a **.bigWig** file.
First we select one of the ChIP samples.
```{r genome-browser.file}
# list the bam files in the directory
# the '$' sign tells the pattern recognizer to omit bam.bai files
bam_files = list.files(
path = data_path,
full.names = TRUE,
pattern = 'bam$'
)
# select the first bam file
chip_file = bam_files[1]
```
We will use the `readGAlignments()` function from the `GenomicAlignments`
package to load the reads into **R**, and then the `GRanges()` function
to convert them into a `GRanges` object.
```{r genome-browser.alignments}
# load the genomic alignments package
library(GenomicAlignments)
# read the ChIP reads into R
reads = readGAlignments(chip_file)
# the reads need to be converted to a granges object
reads = granges(reads)
```
Because DNA fragments are being sequenced from their ends (both the 3' and 5' end),
the read enrichment does not correspond to the exact location of the bound protein.
Rather, reads end to form clusters of enrichment upstream and downstream of the true binding location.
To correct for this, we use a small hack. Before we create the signal profiles,
we will extend the reads towards their __3'__ end. The reads are extended to
form fragments of 200 base pairs. This is an empiric measure, which
corresponds to the average fragment size of the Illumina sample preparation kit.
The exact average fragment size will differ from 200 base pairs, but if the
deviation is not large (i.e. more than 200 base pairs),
it will not affect the visual properties of our samples.
Read extension is done using the `resize()` function. The function
takes two arguments:
1. `width`: resulting fragment width
2. `fix`: which position of the fragment should not be changed (if `fix` is set to start,
the reads will be extended towards the __3'__ end. If `fix` is set to end, they will
be extended towards the __5'__ end)
```{r genome-browser.resize}
# extends the reads towards the 3' end
reads = resize(reads, width=200, fix='start')
# keeps only chromosome 21
reads = keepSeqlevels(reads, 'chr21', pruning.mode='coarse')
```
Conversion of reads into coverage vectors is done with the `coverage()`
function.
The function takes only one argument (`width`), which corresponds to chromosome sizes.
For this purpose we can use the, previously created, `seqlengths` variable.
The `coverage()` function converts the reads into a compressed `Rle` object. We have introduced these workflows in Chapter \@ref(genomicIntervals).
```{r genome-browser.coverage}
# convert the reads into a signal profile
cov = coverage(reads, width = seqlengths)
```
```{r genome-browser.coverage.show, include=TRUE, echo=FALSE, eval=TRUE}
head(cov, 5)
```
The name of the output file is created by changing the file suffix from **.bam**
to **.bigWig**.
```{r genome-browser.rename}
# change the file extension from .bam to .bigWig
output_file = sub('.bam','.bigWig', chip_file)
```
Now we can use the `export.bw()` function from the rtracklayer package to
write the bigWig file.
```{r genome-browser.export}
# load the rtracklayer package
library(rtracklayer)
# export the bigWig output file
export.bw(cov, 'output_file')
```
#### Vizualization of track data using Gviz
We can create genome browserlike visualizations using the `Gviz` package,
which was introduced in Chapter \@ref(genomicIntervals).
The `Gviz` is a tool which enables exhaustive customized visualization of
genomics experiments. The basic usage principle is to define tracks, where each track can represent
genomic annotation, or a signal profile; subsequently we define the order
of the tracks and plot them.
Here we will define two tracks, a genome axis, which will show the position
along the human chromosome 21; and a signal track from our CTCF experiment.
```{r genome-browser.gviz}
library(Gviz)
# define the genome axis track
axis = GenomeAxisTrack(
range = GRanges('chr21', IRanges(1, width=seqlengths))
)
# convert the signal into genomic ranges and define the signal track
gcov = as(cov, 'GRanges')
dtrack = DataTrack(gcov, name = "CTCF", type='l')
# define the track ordering
track_list = list(axis,dtrack)
```
Tracks are plotted with the `plotTracks()` function. The `sizes` argument needs to be the same size as the track_list, and defines the
relative size of each track.
Figure \@ref(fig:genome-browser-gviz-show) shows the output of the
`plotTracks()` function.
```{r genome-browser-gviz-show, fig.cap='ChIP-seq signal visualized as a browser track using Gviz.', fig.width=8, fig.height = 3}
# plot the list of browser tracks
# sizes argument defines the relative sizes of tracks
# background title defines the color for the track labels
plotTracks(
trackList = track_list,
sizes = c(.1,1),
background.title = "black"
)
```
### Plus and minus strand cross-correlation
Cross-correlation between plus and minus strands is a method
which quantifies whether the DNA library was enriched for fragments of
a certain length.
Similarity between the plus and minus strands defined as the correlation of
the signal profiles for the reads that map to the **+** and the **-** strands.
The distribution of reads is shown in Figure \@ref(fig:Figure-BrowserScreenshot).
```{r Figure-BrowserScreenshot, echo=FALSE, include=TRUE, fig.cap='Browser screenshot of aligned reads for one ChIP, and control sample. ChIP samples have an asymetric distribution of reads; reads mapping to the + strand are located on the left side of the peak, while the reads mapping to the - strand are found on the right side of the peak.'}
knitr::include_graphics('./Figures/BrowserScreenshot.png')
```
Due to the sequencing properties, reads which correspond to
the __5'__ fragment ends will map to the opposite strand from the reads
coming from the __3'__ ends. Most often (depending on the sequencing protocol)
the reads from the __5'__ fragment ends map to the **+** strand,
while the reads from the __3'__ ends map to the **-** strand.
We calculate the cross-correlation by shifting the signal on the **+** strand,
by a pre-defined amount (i.e. shift by 1 - 400 nucleotides), and calculating,
for each shift, the correlation between the **+**, and the **-** strands.
Subsequently we plot the correlation versus shift, and locate the maximum value.
The maximum value should correspond to the average DNA fragment length which
was present in the library. This value tells us whether the ChIP enriched for
fragments of certain length (i.e. whether the ChIP was successful).
Due to the size of genomic data, it might be computationally prohibitive to
calculate the Pearson correlation between whole genome (or even whole chromosome)
signal profiles.
To get around this problem, we will resort to a trick; we will disregard the dynamic
range of the signal profiles, and only keep the information of which
genomic bases contained the ends of the fragments.
This is done by calculating the coverage vector of the read starting position (separately
for each strand), and converting the coverage vector into a Boolean vector.
The Boolean vector contains the information of which genomic positions
contained the DNA fragment ends.
Similarity between two Boolean vectors can be promptly computed using the Jaccard index.
The Jaccard index is defined as an intersection between two Boolean vectors,
divided by their union as shown in Figure \@ref(fig:FigureJaccardSimilarity).
```{r, FigureJaccardSimilarity, echo=FALSE,fig.align = 'center', fig.cap="Jaccard similarity is defined as the ratio of the intersection and union of two sets.",out.width="30%"}
knitr::include_graphics('./Figures/Jaccard.png')
```
Firstly, we load the reads for one of the CTCF ChIP experiments.
Then we create signal profiles, separately for reads on the **+** and **-**
strands.
Unlike before, we do not extend the reads to the average expected fragment
length (200 base pairs); we keep only the starting position of each read.
```{r correlation.load}
# load the reads
reads = readGAlignments(chip_file)
reads = granges(reads)
# keep only the starting position of each read
reads = resize(reads, width=1, fix='start')
reads = keepSeqlevels(reads, 'chr21', pruning.mode='coarse')
```
Now we can calculate the coverage vector of the read starting position.
The coverage vector is then automatically converted into a Boolean vector by
asking which genomic positions have $coverage > 0$.
```{r correlation.coverage}
# calculate the coverage profile for plus and minus strand
reads = split(reads, strand(reads))
# coverage(x, width = seqlengths)[[1]] > 0
# calculates the coverage and converts
# the coverage vector into a boolean
cov = lapply(reads, function(x){
coverage(x, width = seqlengths)[[1]] > 0
})
cov = lapply(cov, as.vector)
```
We will now shift the coverage vector from the plus strand by $1$ to $400$ base pairs, and for each pair shift we will calculate the Jaccard index between the vectors
on the plus and minus strand.
```{r correlation.jaccard, cache=TRUE}
# defines the shift range
wsize = 1:400
# defines the jaccard similarity
jaccard = function(x,y)sum((x & y)) / sum((x | y))
# shifts the + vector by 1 - 400 nucleotides and
# calculates the correlation coefficient
cc = shiftApply(
SHIFT = wsize,
X = cov[['+']],
Y = cov[['-']],
FUN = jaccard
)
# converts the results into a data frame
cc = data.frame(fragment_size = wsize, cross_correlation = cc)
```
We can finally plot the shift in base pairs versus the correlation coefficient:
```{r correlation-plot, fig.cap='The figure shows the correlation coefficient between the ChIP-seq signal on + and $-$ strands. The peak of the distribution designates the fragment size'}
library(ggplot2)
ggplot(data = cc, aes(fragment_size, cross_correlation)) +
geom_point() +
geom_vline(xintercept = which.max(cc$cross_correlation),
size=2, color='red', linetype=2) +
theme_bw() +
theme(
axis.text = element_text(size=10, face='bold'),
axis.title = element_text(size=14,face="bold"),
plot.title = element_text(hjust = 0.5)) +
xlab('Shift in base pairs') +
ylab('Jaccard similarity')
```
Figure \@ref(fig:correlation-plot) shows the shift in base pairs,
which corresponds to the maximum value of the correlation coefficient
gives us an approximation to the expected average DNA fragment length.
Because this value is not 0, or monotonically decreasing, we can conclude
that there was substantial enrichment of certain fragments in the ChIP samples.
### GC bias quantification
The PCR amplification procedure can cause a significant bias in the ChIP
experiments. The bias can be influenced by the DNA fragment size distribution,
sequence composition, hexamer distribution of PCR primers, and the number of cycles used
for the amplification.
One way to determine whether some of the samples have significantly
different sequence composition is to look at whether regions with
differing GC composition were equally enriched in all experiments.
We will do the following: Firstly we will calculate the GC content of each
of the tilling windows, and then we will compare the GC content with the corresponding
cpm\index{cpm} (count per million reads) value, for each tile.
```{r gc.tile}
# fetches the chromosome lengths and constructs the tiles
library(GenomeInfoDb)
library(GenomicRanges)
hg_chrs = getChromInfoFromUCSC('hg38')
hg_chrs = subset(hg_chrs, grepl('chr21$',chrom))
seqlengths = with(hg_chrs, setNames(size, chrom))
# tileGenome produces a list per chromosome
# unlist combines the elemenents of the list
# into one GRanges object
tilling_window = unlist(tileGenome(
seqlengths = seqlengths,
tilewidth = 1000
))
```
We will extract the sequence information from the `BSgenome.Hsapiens.UCSC.hg38`
package. `BSgenome` are generic Bioconductor containers for genomic sequences.
Sequences are extracted from the `BSgenome` container using the `getSeq()` function.
The `getSeq()` function takes as input the genome object, and the ranges with the
regions of interest; in our case, the tilling windows.
The function returns a `DNAString` object.
```{r gc.getseq}
# loads the human genome sequence
library(BSgenome.Hsapiens.UCSC.hg38)
# extracts the sequence from the human genome
seq = getSeq(BSgenome.Hsapiens.UCSC.hg38, tilling_window)
```
To calculate the GC content, we will use the `oligonucleotideFrequency()` function on the
`DNAString` object. By setting the width parameter to 2 we will
calculate the **dinucleotide** frequency.
Each row in the resulting table will contain the number of all possible
dinucleotides observed in each tilling window.
Because we have tilling windows of the same length, we do not
necessarily need to normalize the counts by the window length.
If all of the windows have different lengths (i.e. when at the ChIP-seq peaks), then normalization is a prerequisite.
```{r gc.oligo}
# calculates the frequency of all possible dimers
# in our sequence set
nuc = oligonucleotideFrequency(seq, width = 2)
# converts the matrix into a data.frame
nuc = as.data.frame(nuc)
# calculates the percentages, and rounds the number
nuc = round(nuc/1000,3)
```
Now we can combine the GC frequency with the cpm values.
We will convert the cpm values to the log10 scale. To avoid
taking the $log(0)$, we add a pseudo count of 1 to cpm.
```{r gc.cpm, cache=TRUE, warning=FALSE}
# counts the number of reads per tilling window
# for each experiment
so = summarizeOverlaps(tilling_window, bam_files)
# converts the raw counts to cpm values
counts = assays(so)[[1]]
cpm = t(t(counts)*(1000000/colSums(counts)))
# because the cpm scale has a large dynamic range
# we transform it using the log function
cpm_log = log10(cpm+1)
```
Combine the cpm values with the GC content,
```{r gc-combine}
gc = cbind(data.frame(cpm_log), GC = nuc['GC'])
```
and plot the results.
```{r gc-plot, fig.cap='GC content abundance in a ChIP-seq experiment.'}
ggplot(
data = gc,
aes(
x = GC,