Skip to content

Latest commit

 

History

History
52 lines (30 loc) · 3.43 KB

README.md

File metadata and controls

52 lines (30 loc) · 3.43 KB

Albert-Ludwigs-Universität Freiburg

Lehrstuhl für Bioinformatik - Institut für Informatik - http://www.bioinf.uni-freiburg.de


Bioinformatics 1

WS 2021/2022
Exercise sheet 9: Multiple Sequence Alignment

Exercise 1 - Progressive Alignment by Feng and Doolittle

Given the sequences S1 = CTCACA, S2 = CAC, S3 = GTAC and the following scoring function:

score

We want to do progressive alignment following Feng and Doolittle. The needed pairwise alignments are calculated using the Needleman-Wunsch and are as follows:

alignments

We want to follow one step of the algorithm introduced in the lecture. The following guide trees are given in Newick format.

a) Starting with the guide tree ((S1, S3), S2), what would be the starting group1?

b) Use the Needleman-Wunsch algorithm to generate all pairwise alignments against group1 and calculate their respective similarity score.

c) Based on the previously calculated pairwise alignments what are the possible choices for group2?

d) Calculate the sum-of-pairs scores for each of the possible group2 choices.

e) Which alignment will be chosen as group2 for the next step?

f) Based on what you have learned, what are the alignments and sum-of-pairs scores for the guide tree ((S2, S3), S1)?

Exercise 2: Scoring Matrices

Determine the correct text by replacing the highlighted words with the correct anagrams:

Scoring matrices reflect the fact that amino acids with similar (A) isehcayhilccpmo properties can be more easily substituted than those without similar characteristics, since they are more likely to cause (B) rsistnupoid to the structure and function. This type of disruptive (C) uuttotsbnisi is less likely to be selected in evolution because it renders (D) iouanctfnlonn proteins.

PAM matrices, except PAM1, are derived from an (E) raniyeooutvl model. The increasing PAM numbers correlate with increasing PAM units and thus evolutionary (F) tsnascedi of protein sequences. For example, PAM250, which corresponds to about 20% amino acid (G) tyedniti, represents 250 mutations per 100 residues (a position could mutate several times). In theory, the number of (E) raniyeooutvl changes approximately corresponds to an expected (E) raniyeooutvl span of 2,500 million years. Thus, the PAM250 matrix is normally used for (H) neirtdgve sequences.

BLOSUM matrices are derived based on direct observation for every possible amino acid (C) uuttotsbnisi in multiple sequence alignments. Instead of using the (I) earoopatitnlx function, the BLOSUM matrices are actual percentage identity values of sequences selected for construction of the matrices. For example, BLOSUM62 indicates that the sequences selected for constructing the matrix share an average identity value of 62%.

This is why the PAM matrices are used most often for reconstructing (J) ogctnihpyele trees. However, because of the mathematical (I) earoopatitnlx procedure used, the PAM values may be less realistic for (H) neirtdgve sequences.