Skip to content

FluDB coding part alignment

Sergey Venev edited this page Jan 27, 2016 · 16 revisions

Available scripts with brief description

  • filter_fludb.py takes fasta files with the piles of influenza segments downloaded from fludb.org and filters out "corrupted" sequences. Corrupted in this case are the sequences that are too short, has too much undefined nucleotides, etc. Filtered sequences are written in separate files one for each segment.

  • align_coding_seq.py takes filtered fasta files for each segment, identifies coding part for each sequence in the file and aligns these coding sequences. Coding part is identified with primers that stored in separate files for each segment, where primer is merely a ~20-30nt long of a piece of the consensus coding sequence from 3' and 5' ends. Some of the sequences are discarded during the alignment step, due to deletions in the sequence body, poor primer matching, etc. Resulted alignments are written in separate file for each segment.

  • align_8_segments.sh sequentially launches 8 align_coding_seq.py for each influenza segment with corresponding primers.

  • subsample_msa_random.py takes a single MSA file and a size N of subsample for output. It generates human-readable html formatted MSA with N sequences including the sequence with longest 3' UTR and a sequence with longest 5' UTR (is UTRs are present in the alignment).

  • characterize_8_alignments.py takes all 8 MSA for all influenza segments and prints brief statistical report. Report includes MSA's size and length, number and fraction of variable positions, significantly variable positions, the most variable one, and gaps info (gaps are allowed only near the 3' and 5' ends of the coding sequence). The script also calculates the consensus of the coding sequence and stores it in a separate fasta file (all 8 segments).

  • get_loci_of_interest.py takes 8 MSA alignments for all influenza segments, reference flu genome and generates a list of interesting positions in the genome, that must be tested for association with RNA structure down the road. The script extracts detailed codon-level and AminoAcid-level information about all mutations present in the alignment. Extracted information can be further used for the criteria of loci selection.

Trying to do some Lab-journaling using python notebook updated capabilities ...

ipy-example

Extra notes

FluDB coding part alignment

Some notes on the get_aligned_pos(sequence,primer,direction) function that aligns primers to the ref-sequence. This function returns the coordinates of the start and stop positions of the putative coding sequence in the reference frame of the provided sequence. So, given the length of the sequence L, start position can have negative coordinate value, while stop position can exceed L in its coordinate value. There are a couple of illustrations associated with that: For the left side: left side: determine start position of the putative coding sequence And for the right: right side: determine stop position of the putative coding sequence and its length It is important to get precise coordinates as they are further used to construct the alignment of the putative coding sequences.