Skip to content

Latest commit

 

History

History
78 lines (35 loc) · 11.6 KB

README.md

File metadata and controls

78 lines (35 loc) · 11.6 KB

Introduction

One of the main focuses of biology during the past 50 years has been the characterization of the genetic variation associated with certain phenotypes. It has been possible to identify this genetic variation thanks to the sequencing of the human genome (International Human Genome Sequencing Consortium 2001) and the discovery of the human genome sequence variation(The International SNP Map Working Group 2001). This led to the design of arrays that were able to genotype several variants in a genome-wide fashion. During the last decade, with the prices of whole-genome sequencing getting cheaper and cheaper, high-throughput sequencing has been the technology of choice, knocking off SNP arrays.

With both marker information - derived from variant calling on whole-genome sequencing - and phenotypic measures Genome Wide Association Studies (GWAS) can be performed to test for association between the trait and genetic markers in a population of interest. GWAS have been mostly focused on humans. But the increasing number of bacterial genomes available has caused GWAS to be used to determine genetic variation associated with different traits in bacterial populations. The main challenge of this project was to deal with bacterial GWAS problems: strong population structure and causative variation presence on the pangenome.

Due to the haploid clonal nature of bacteria and the lack of recombination, all variants in bacterial genomes are correlated. What this means is that if we introduce 10 new variants in a bacterial population of which only one is causal of the phenotype of study, we will not be able to identify this variant as causal. Instead, we would find that all 10 variants are related to the phenotype since they are correlated between them due to the lack of recombination and the clonal nature of bacteria. This phenomenon can be seen as a genome-wide linkage disequilibrium. In turn, this generates clonal lineages with discrete genetic backgrounds. Associations between variants that are correlated with these genetic backgrounds and the phenotype are known as "lineage effects". These effects although inevitable, should be taken into account.

Is every variant association going to be cofounded with lineage effects? No. There are cases were an association might be correctly mapped to "locus effects". Both horizontal DNA transfer and homoplasy across the phylogeny can help break the LD and the correlation of variants across the genome. The relative importance of either of these factors will depend on the species of study.

The classic approach for variant calling is based on alignment of the reads against a reference genome. The choice of the reference genome is key on bacterial populations. If we only include the core genome, there are going to be misalignments with the accessory genomes of the different strains. This would jeopardize our power to detect variants of interest. It is preferred to use the pangenome as the reference genome when working with bacterial populations.

Methods

Biological data

The dataset used in this project is publicly available [https://evocellnet.github.io/ecoref/] (Galardini et al. 2017). It consists of 696 Escherichia coli strains phenotyped for 214 environments. The measured phenotype is growth. The authors provide phenotypic information and genotypic information that were used in my analyses. Other relevant information is also available for the dataset such as pangenome, phylogeny, conditions and strains.

Software

After a literature research, two different programs were chosen as candidates to perform the analyses: bugwas (Earle et al. 2016) and pyseer(Lees et al. 2018).

Bugwas is implemented in R and it tests for locus and lineage associations in bacterial GWAS. One thing to consider with bugwas is that it only takes binary, discrete phenotypes. The typical processing pipeline starts with an alignment-free k-mer counting to encapsulate genetic variability. This deals with the problem of not capturing enough variants due to the high variability in the pangenome of the bacterial population. In order to test for lineage effects and locus effects bugwas is based on the fact that Principal Component Analysis (PCA) can correct for population structure on GWAS studies (Reich et al. 2006) if these are included as fixed effects on the Linear Mixed Model. Every Principal Component (PC) will correspond to a lineage on bacterial populations. To detect lineage effects while boosting power Bugwas includes the PC – read 'lineages' here – as random effects on the LMM and recovers indications of lineage-level associations. Then it tests these associations between the lineage and the phenotype using a Wald test (Wald 1943). One thing to consider with bugwas is that it only takes binary, discrete phenotypes.

Pyseer is a Python reimplementation of SEER (Lees et al. 2016). What characterizes pyseer is its comprehensive and reconciling nature, adding previously separated approaches, bugwas and scoary _(_Brynildsrud et al. 2016), into one single package. Pyseer can take as input different formats that encapsulate genetic diversity in the population (SNPs, k-mers or presence/absence of COGs), it also incorporates different association models (Fixed effects Generalized Linear Models and Linear Mixed Models). Furthermore, it includes bugwas method to estimate possible lineage effects.

Although bugwas outputs plots directly, pyseer output consists of raw data in .tsv format. Because of this I had to modify qqman_(D. Turner 2018)_ an existing R package that plots human-driven GWAS results.

Scripts

An important technical difficulty of using bugwas is making sure that the identity and order of samples between the VCF file, phenotype file and population structure match up correctly. Therefore, I had to code my own Python script to correctly format the data and match the samples among files before using these data as input to bugwas. Furthermore, I had to normalize between 0 and 1 the phenotype values and remove the invariant sites from the original VCF.

In order to run bugwas,a population phylogenetic tree has to be provided. As different conditions on this dataset may have different sets of samples, I also generalized the construction of a specific phylogenetic tree for every condition. For this I used the command line tools in Tassel (Bradbury et al. 2007). Inside the main script I call other secondary R scripts of my own that transform the output from Tassel to a newick tree and run bugwas' lin_loc (lineage and locus effects) function with all the required parameters, everything is run on default. This function outputs Manhattan Plots, QQ plots and PCA plots for a given condition. These give us information about significant associations between the variant tested and the condition. I automated all of these analyses for the 214 environments inside a Snakemake pipeline where the user can tweak the directories and the phenotypes to loop through. The bugwas run is more script focused, setting rules, conditions and parameters inside the script itself.

The pyseer implementation is more pipeline focused, defining rules and loops inside the pipeline. Hence, the processing is easier to follow and modify. The user has a yaml config file where most tweakable parameters can be modified in addition to directories and other general best practices. Pyseer, although having more options and being more versatile than bugwas, is more user-friendly, dealing with most of the formatting problems bugwas has automatically. The pipeline that is right now implemented is using a VCF file (--vcf) as input for genetic diversity in the samples, a linear mixed model for association (--lmm) that uses a kinship matrix as an indication of similarities between samples to correct for population structure (--similarity) and is also able to output bugwas lineage effects (--lineage) using the same kinship matrix as before to estimate distances between samples (--distances). As I've mentioned on the previous section, pyseer output is tab delimited, so I had to modify qqman's manhattan function in order to plot the results. What I basically did is modify the logic that loops through chromosomes in human driven GWAS and loop through conditions. Moreover, if there are similar conditions (i.e. increasing molarity of the same compound), the modified function plots these conditions on the same axes with different colors (see Results).

Discussion

Although the results obtained through pyseer were good enough to trust, there are still things that need to be addressed. The built pipeline uses SNPs as input for the genetic diversity in the population but as it's been mentioned on the introduction, an alignment-free k-mer counting is preferable due to the high variability of sequences present in the pangenome.

Moreover, bugwas was implemented in a script fashion rather than in a pipeline like pyseer's case. Even though the current state of the scripts documentation allows for proper understanding of the logic and processing, it would've been better if it was fully implemented in a pipeline.

Bibliography

Bradbury PJ, Zhang Z, Kroon DE, Casstevens TM, Ramdoss Y, Buckler ES. 2007. TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics 23: 2633–2635.

Brynildsrud O, Bohlin J, Scheffer L, Eldholm V. 2016. Rapid scoring of genes in microbial pan-genome-wide association studies with Scoary. Genome Biology 17: 238.

D. Turner S. 2018. qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots. Journal of Open Source Software 3: 731.

Earle SG, Wu C-H, Charlesworth J, Stoesser N, Gordon NC, Walker TM, Spencer CCA, Iqbal Z, Clifton DA, Hopkins KL, Woodford N, Smith EG, Ismail N, Llewelyn MJ, Peto TE, Crook DW, McVean G, Walker AS, Wilson DJ. 2016. Identifying lineage effects when controlling for population structure improves power in bacterial association studies. Nature Microbiology 1: 16041.

Galardini M, Koumoutsi A, Herrera-Dominguez L, Varela JAC, Telzerow A, Wagih O, Wartel M, Clermont O, Denamur E, Typas A, Beltrao P. 2017. Phenotype inference in an Escherichia coli strain panel. eLife, doi 10.7554/eLife.31035.

International Human Genome Sequencing Consortium. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860–921.

Lees JA, Galardini M, Bentley SD, Weiser JN, Corander J. 2018. pyseer: a comprehensive tool for microbial pangenome-wide association studies. Bioinformatics 34: 4310–4312.

Lees JA, Vehkala M, Välimäki N, Harris SR, Chewapreecha C, Croucher NJ, Marttinen P, Davies MR, Steer AC, Tong SYC, Honkela A, Parkhill J, Bentley SD, Corander J. 2016. Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes. Nature Communications 7: 12797.

Reich D, Patterson NJ, Shadick NA, Weinblatt ME, Price AL, Plenge RM. 2006. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics 38: 904–909.

The International SNP Map Working Group. 2001. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409: 928–933.

Wald A. 1943. Tests of Statistical Hypotheses Concerning Several Parameters When the Number of Observations is Large. Transactions of the American Mathematical Society 54: 426–482.

Zan Y, Carlborg Ö. 2019. Yeast growth responses to environmental perturbations are associated with rewiring of large epistatic networks. bioRxiv 659730.