Recipes for data analysis of amplicon sequencing data
Available recipes:
- 16S rRNA gene amplicon pipeline (
16S_pipeline.sh
)
Data processing pipeline for 16S rRNA gene amplicon raw sequencing data
This pipeline depends on USEARCH and CUTADAPT.
The pipeline has been tested with USEARCH v.10.0.240 and CUTADAPT v.1.9.1.
- Clone this repository:
git clone https://github.com/SushiLab/Amplicon_Recipes.git
- Add the
16S_pipeline.sh
script to your PATH by modifying your.bashrc
file:export PATH="$PATH:<path_to_16S_pipeline.sh>"
- The pipeline requires USEARCH and CUTADAPT to be executable with the exact commands
usearch
andcutadapt
. You can test that by typing both commands.
If this does not work, you can add aliases to your.bashrc
file. Use a text editor to add the linesalias usearch=<path_to_USEARCH>
andalias usearch=<path_to_CUTADAPT>
to your.bashrc
file and source it:source ~/.bashrc
.
The 16S rRNA gene amplicon pipeline processes demultiplexed pair-end fastq
files and produces OTU/zOTU abundance tables through the following steps:
- Merging of pair-end reads.
- Quality filtering.
- Primer matching (optional but recommended).
- Dereplication.
- OTU clustering with the UPARSE algorithm (97% id).
- zOTU denoting with the UNOISE3 algorithm.
- Taxonomic annotation of OTUs and zOTUs against SILVA database with LCA approach (optional).
- Quantification of OTU and zOTU abundances.
If -ref is used for a defined community the pipeline processes demultiplexed pair-end fastq
files and produces OTU abundance tables through the following steps:
- Merging of pair-end reads.
- Quality filtering.
- Primer matching (optional but recommended).
- Dereplication.
- Classification by alignment to reference strain sequences
- OTU clustering of unclassified reads with the UPARSE algorithm (97% id).
- Reclassification by alignment to reference strain sequences plus unclassified OTUs
- Taxonomic annotation of unclassified OTUs against SILVA database with LCA approach (optional).
- Quantification of reference strain and unclassified OTU abundances.
No primer sequences: will skip the primer matching. Only recommended if primer sequences have been already removed and reads trimmed.
16S_pipeline.sh -input_f <input_folder> -output_f <output_folder> -db <path_to_SILVA_database.fasta>
Primer sequences present: will include the primer matching. Recommended if primer sequences have not been yet removed.
16S_pipeline.sh -input_f ./data/ -output_f ./out -db <path_to_SILVA_database.fasta> -primerF <forward_primer> -primerR <reverse_primer> -threads <num_threads>
Example: using SILVA_128_SSURef_Nr99 database and the 515F-Y / 806RB primers for the V4 region:
16S_pipeline.sh -input_f ./data/ -output_f ./out -db SILVA_128_SSURef_Nr99_tax_silva_trunc.fasta -primerF GTGYCAGCMGCCGCGGTAA -primerR ATTAGAWACCCBNGTAGTCC -threads 10
Mandatory options:
-input_f Path to input folder with demultiplexed raw reads fastq files.
Files must start with the sample identifier. Sample identifier is taken from the FASTQ file name by truncating at the first underscore or period.
R1 and R2 files for each sample must have the exact same file name but 'SampleID_R1.fastq' and 'SampleID_R2.fastq'.
-output_f Path to output folder.
General options:
-threads[1-N] Number of threads used (default=1)
-help Show this help
Merging:
-pctid[0-100] Minimum percentage id of alignment (default=90)
-minoverlap [0-N] Discard pair if alignment is shorter than given value (default=16)
Quality filtering:
-maxee[0-N] Expected errors: discard reads with expected errors > maxee (default=0.1)
-minlength[0-N] Discard sequences with length < minlength (default=100)
Primer match: (skipped if -primerF or primerR options are missing)
-primerF Forward primer with IUPAC wildcard characters.
-primerR Reverse primer with IUPAC wildcard characters (reverse-complement needed).
-minprimfrac[0-1] Minimum fraction of the primer searched by cutadapt (default=1, i.e. the entire primer).
-maxmismatch[1-N] Number of mismatches allowed by cutadapt in each primer (default=0)
Clustering:
-minsize[0-1] Minimum sequence abundance to be considered (default=1, i.e. include singletons)
Taxonomical annotation: (skipped if -db option is missing)
-db Path to database for taxonomical annotation (SILVA db suggested: https://www.arb-silva.de/fileadmin/silva_databases/release_128/Exports/SILVA_128_SSURef_Nr99_tax_silva_trunc.fasta.gz)
-tax_id [0-1] Minimum identity for taxonomic search (default=0.90)
Defined community:
-ref Path to a fasta file of reference sequences for a defined community. If this option is given, only unclassifiable sequences will be de novo clustered.
The complete pipeline will produce the following output files:
Report:
report.txt
Report of the whole pipeline with basic statistics
Merging:
merged.fq
Fastq file with merged reads
merging.log
Logfile for the merging step
Filtering:
filtered.fa
Fasta file with quality filetered reads
filtered_primermatch.fa
Fasta file with reads matchin the primers
filter.log
Logfile for the quality filtering step
De-replication:
uniques.fa
Fasta file with de-replicated sequences
dereplication.log
Logfile for the quality de-replication step
Clustering / Denoising:
otus_uparse.fa
Fasta file with OTU representative sequences
otutab_uparse.*
OTU table (3 available formats)
clustering.log
Logfile for the UPARSE clustering step
make_otutab_uparse.log
Logfile for the OTU table quantification step
otus_unoise.fa
Fasta file with zOTU sequences
otutab_unoise.*
zOTU table (3 available formats)
denoising.log
Logfile for the denoising step
make_otutab_unoise.log
Logfile for the zOTU table quantification step
Taxonomic annotation:
taxonomy_uparse_lca.txt
Taxonomic annotation of OTUs
taxonomy_unoise_lca.txt
Taxonomic annotation of zOTUs
taxsearch_uparse.tax
All hits to the taxonomic database for OTUs
taxsearch_unoise.tax
All hits to the taxonomic database for zOTUs
taxsearch_uparse.log
Logfile for the OTU taxonomic annotation step
taxsearch_unoise.log
Logfile for the zOTU taxonomic annotation step
Defined community:
initial_classification.txt
Initial classification of sequences
otutab_initial_classified.*
Initial OTU table (3 available formats)
unclassified_uniques.fa
Fasta file with dereplicated unclassified sequences
otus_unclassified.fa
Fasta file with unclassified OTU sequences
otutab_unclassified.txt
OTU table for unclassified sequences
final_references.fa
Reference strain sequences plus unclassified OTU sequences
final_classification.txt
Final classification of sequences
otutab_final_classified.*
Final OTU table (3 available formats)
taxsearch_unclassified.tax
All hits to the taxonomic database for unclassified OTUs
taxonomy_unclassified_lca.txt
Taxonomic annotation of unclassified OTUs
*.log
Each step produces a relevantly named log file