04-CNVanalysis.Rmd


# Copy number variation estimation from scRNA-seq { #CNV_analysis }

## Introduction
Copy number variation is a major mutations in many tumors. Recently, 
[Minussi et al, 2021](https://www.nature.com/articles/s41586-021-03357-x) 
suggested novel evolutionary patterns through analyzing CNV in breast tumors.

There are three types of CNVs:

- Copy gain
- Copy loss
- Loss of heterozygosity

They some may not affect the total copies, e.g., Loss of heterozygosity, while 
some may not affect the balance of two allele, e.g., copy gain (2 vs 2). 

There are four major methods for CNV analyis in scRNA-seq:

- [inferCNV](https://github.com/broadinstitute/infercnv)
- [CopyKat](https://github.com/navinlabcode/copykat) by [Gao et al 2020](https://www.nature.com/articles/s41587-020-00795-2)
- [CaSpER](https://github.com/akdess/CaSpER) by [Harmanci et al, 2020](https://www.nature.com/articles/s41467-019-13779-x)
- [HoneyBadger](https://jef.works/HoneyBADGER/) by [Fan et al](https://genome.cshlp.org/content/early/2018/06/13/gr.228080.117)

Here, we will illustration how to use the method `inferCNV` for analysing CNV in
single-cell RNA-seq data. We will discuss other methods in the end.

## inferCNV and example

[InferCNV: Inferring copy number alterations from tumor single cell RNA-Seq data](https://github.com/broadinstitute/inferCNV/wiki)

### install inferCNV
**Software Requirements**

- [JAGS](https://mcmc-jags.sourceforge.io/)
- R (>3.6)

In order to run infercnv, JAGS (Just Another Gibbs Sampler) must be installed.

Download JAGS from <https://sourceforge.net/projects/mcmc-jags/files/JAGS/4.x/> and install JAGS in your environment (windows/MAC).

If you use inferCNV on server, install JAGS via `conda install` in your conda environment is recommended.

```
conda install -c conda-forge jags
```

More details refer to [inferCNV wiki page](https://github.com/broadinstitute/inferCNV/wiki)

**Five options for installing inferCNV**

Option A: Install infercnv from BioConductor (preferred)
```
if (!requireNamespace("BiocManager", quietly = TRUE))
     install.packages("BiocManager")
BiocManager::install("infercnv")
```

For more other options, refer to [Five options for installing inferCNV](https://github.com/broadinstitute/inferCNV/wiki/Installing-infercnv)

**Data requirements**

- a raw counts matrix of single-cell RNA-Seq expression
- an annotations file which indicates which cells are tumor vs. normal.
- a gene/chromosome positions file

[File-Definitions](https://github.com/broadinstitute/inferCNV/wiki/File-Definitions)


### getting started

If you have installed infercnv from BioConductor, you can run the example data with:

```
library(infercnv)

infercnv_obj = CreateInfercnvObject(raw_counts_matrix=system.file("extdata", "oligodendroglioma_expression_downsampled.counts.matrix.gz", package = "infercnv"),
                                    annotations_file=system.file("extdata", "oligodendroglioma_annotations_downsampled.txt", package = "infercnv"),
                                    delim="\t",
                                    gene_order_file=system.file("extdata", "gencode_downsampled.EXAMPLE_ONLY_DONT_REUSE.txt", package = "infercnv"),
                                    ref_group_names=c("Microglia/Macrophage","Oligodendrocytes (non-malignant)")) 

infercnv_obj = infercnv::run(infercnv_obj,
                             cutoff=1, # cutoff=1 works well for Smart-seq2, and cutoff=0.1 works well for 10x Genomics
                             out_dir=tempfile(), 
                             cluster_by_groups=TRUE, 
                             denoise=TRUE,
                             HMM=TRUE)
```

If you can run the getting started part with demo data provided by inferCNV, then it is installed successfully.


**Demo Example Figure**

![Demo Example Figure](CNV_analysis_file/figs/demo-infercnv.png){width=85%}

## Application on TNBC1

### data description

TNBC1 is a triple negative breast cancer tumor sample of high tumor purity (72.6%) with 796 single tumor cells and 301 normal cells. The dataset is available on NCBI GEO under the accession number GSM4476486.

: Details of TNBC1 dataset (from published articles, copyKAT).

+-------------------------------+--------------------+-----------------------+------------------------------------+
| TNBC1                         | Number of clones   | Number of tumor clones|  Tumor clone-specific copy gain    |
+===============================+====================+=======================+====================================+
| Triple negative breast cancer |         3          |             2         | - C1: 4p, 7q, 9, 17q               |
|                               |                    |                       | - C2: 3p, 6q, 7p, 11q, X           |
+-------------------------------+--------------------+-----------------------+------------------------------------+


- Expression

Clone A       Clone B       Normal
------        ------        ------
488           307           302

Table:  Subclusters of TNBC1 dataset (from gene expression analysis-Seurat).

Notes: 488,307, 247, 55


![umap](CNV_analysis_file/figs/tnbc1-exp/tnbc1-umap-3clusters.jpeg){width=50%}


- B Allele Frenquency (BAF)

![baf1](CNV_analysis_file/figs/tnbc1-baf/0sub_chr.pdf){width=120%}

![baf2](CNV_analysis_file/figs/tnbc1-baf/1sub_chr.pdf){width=120%}

![baf3](CNV_analysis_file/figs/tnbc1-baf/2sub_chr.pdf){width=120%}


- BAF V.S. Expression

![confusion heatmap](CNV_analysis_file/figs/BAFvsExpression.png){width=50%}


### run inferCNV

[data_download](https://sourceforge.net/projects/sgcellworkshop/files/CNV_analysis/CNV_analysis_data.zip/download)


[demo1_log_file](https://github.com/Rongtingting/SingleCell-Workshop-2021/blob/master/CNV_analysis_file/demo1_inferCNV_TNBC1.log)

[demo2_log_file](https://github.com/Rongtingting/SingleCell-Workshop-2021/blob/master/CNV_analysis_file/demo2_inferCNV_TNBC1.log)

[output_files](https://github.com/Rongtingting/SingleCell-Workshop-2021/blob/master/CNV_analysis_file/demo2_outputfiles.lst)


```{r}
# library(infercnv)
# library(utils)
# library (BiocGenerics)

## DEMO1
# 
# tnbc <- read.delim("C://Users/Rongting/Documents/GitHub_repos/combinedTNBC1.txt")
# anno <- tnbc[2,]
# anno <- t(anno)
# anno <- as.data.frame(anno)
# 
# gex <- tnbc[-c(1:2),]
# gex <- type.convert(gex)
# 
# gene_file <- "C://Users/Rongting/Documents/GitHub_repos/gene_note_noheader_unique.txt"

# 
# infercnv_obj = CreateInfercnvObject(raw_counts_matrix=gex,
#                                     annotations_file=anno,
#                                     delim='\t',
#                                     gene_order_file=gene_file,
#                                     ref_group_names= "N")

# output = "C://Users/Rongting/Documents/GitHub_repos/tnbc1_demo"

# infercnv_obj = infercnv::run(infercnv_obj,
#                              cutoff=0.1,  
#                              out_dir= output , 
#                              cluster_by_groups=T,   
#                              denoise=T,
#                              HMM=T)

## DEMO2

# gene_file <- "C://Users/Rongting/Documents/GitHub_repos/gene_note_noheader_unique.txt"
# 
# anno_file <- 'C://Users/Rongting/Documents/GitHub_repos/tnbc-3cluster-id.txt'
# 
# infercnv_obj2 = CreateInfercnvObject(raw_counts_matrix=gex,
#                                     annotations_file=anno_file,
#                                     delim='\t',
#                                     gene_order_file=gene_file,
#                                     ref_group_names= "Normal")
# 
# output = "C://Users/Rongting/Documents/GitHub_repos/tnbc1_demo2"
# 
# infercnv_obj2 = infercnv::run(infercnv_obj2,
#                              cutoff=0.1,
#                              out_dir= output,
#                              cluster_by_groups=T,
#                              denoise=T,
#                              HMM=T)


```


```
#################
##Notes
#################

## load the package
library(Seurat)
library(infercnv)

## prepare the data (cellranger output)

### load count matrix (example)
matrix_path <- "../cellranger/xxxx/count_xxxxx/outs/filtered_gene_bc_matrices/GRCh38/"


### read count matrix
gex_mtx <- Seurat::Read10X(data.dir = matrix_path)

### run inferCNV with loop
celltype = c('CloneA', 'CloneB', 'Normal')

for (i in celltype){
  infercnv_obj1 = CreateInfercnvObject(raw_counts_matrix=gex_mtx,
                                       annotations_file=anno_file,
                                       delim='\t',
                                       gene_order_file=gene_file,
                                       ref_group_names=c(i))
  
  output <- paste0('/groups/cgsd/rthuang/processed_data/inferCNV/xxxx/','xxxx_', i)
  
  infercnv_obj1 = infercnv::run(infercnv_obj1,
                                cutoff=0.1,  
                                out_dir= output , 
                                cluster_by_groups=T,   
                                denoise=T,
                                HMM=T)
}

```


### inferCNV result

- demo1

![infercnv1](./CNV_analysis_file/figs/tnbc1_demo1/infercnv.png){width=85%}

![infercnv2](./CNV_analysis_file/figs/tnbc1_demo1/infercnv.18_HMM_pred.Bayes_Net.Pnorm_0.5.png){width=85%}


- demo2

![infercnv3](./CNV_analysis_file/figs/tnbc1_demo2/infercnv.png){width=85%}

![infercnv4](./CNV_analysis_file/figs/tnbc1_demo2/infercnv.18_HMM_pred.Bayes_Net.Pnorm_0.5.png){width=85%}


## Last notes
There are four major methods for CNV analyis in scRNA-seq:
[inferCNV](https://github.com/broadinstitute/infercnv),
[CopyKat](https://github.com/navinlabcode/copykat), 
[CaSpER](https://github.com/akdess/CaSpER), and
[HoneyBadger](https://jef.works/HoneyBADGER/) 

However, the two BAF-supporting methods HoneyBadger and Casper works less 
accurately from our experience. The expression-only method InferCNV and CopyKat
generally works fine, though it has no capability to detect loss of 
heterozygosity. In general, we suggest perform a pseudo-bulk BAF on the 
identified clones or cell clusters with copy number variations.

Feel free to get in touch (yuanhua@hku.hk) if you need advice on CNV analysis 
on your own data / project.

**ref**


https://www.r-bloggers.com/2012/04/getting-started-with-jags-rjags-and-bayesian-modelling/

```
bookdown::render_book("index.Rmd", "bookdown::gitbook")
```