Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What criteria was used to collect the HSDs in HSDatabase? #7

Open
zx0223winner opened this issue Aug 13, 2022 · 2 comments
Open

What criteria was used to collect the HSDs in HSDatabase? #7

zx0223winner opened this issue Aug 13, 2022 · 2 comments

Comments

@zx0223winner
Copy link
Owner

zx0223winner commented Aug 13, 2022

Although there is no golden rule to distinguish partial duplicates from more complete ones, it is believed that the candidate HSDs turn to have less than 50% amino acid length difference and similar function of conserved domains.

To balance the HSDs detection sensitivity and accuracy, we have improved the duplicates genes detection and decreased the “snowball effect” via using a series of combo threshold from 90%_10aa to 90%_100aa and from 50%_10aa to 50%_100aa, which can to some extent balance the HSDs detection sensitivity and accuracy. The combo threshold was selected via using a series of thresholds: E + (D + (C + (B +A))).

A = 90%_100aa+(90%_70aa+(90%_50aa+(90%_30aa+90%_10aa)))
B = 80%_100aa+(80%_70aa+(80%_50aa+(80%_30aa+80%_10aa)))
C = 70%_100aa+(70%_70aa+(70%_50aa+(70%_30aa+70%_10aa)))
D = 60%_100aa+(60%_70aa+(60%_50aa+(60%_30aa+60%_10aa)))
E = 50%_100aa+(50%_70aa+(50%_50aa+(50%_30aa+50%_10aa)))

@zx0223winner
Copy link
Owner Author

A combination of thresholds was used to acquire a larger dataset of HSD candidates. All-against-all protein sequence similarity search using BLASTP (E-value cutoff of ≤1e-10) filtered via the criteria within certain amino acid length differences and larger than certain amino acid pairwise identities. HSDs candidates were added one after another at different homology assessment metrics (i.e., HSDs identified at more relaxed thresholds were treated more strictly than those found using more conservative thresholds).

For example, HSDs identified at a threshold of 90%_30aa were added on to those identified at a threshold of 90%_10aa (denoted as “ 90%_30aa+90%_10aa”); any redundant HSDs candidates picked out at this combo threshold were removed if the more relaxed threshold (i.e., 90%_30aa) had the identical genes or contained the same gene copies from the stricter cut-off (i.e., 90%_10aa).

Moreover, any HSDs candidates pinpointed at the combo threshold (90%_30aa+90%_10aa) were removed if the minimum gene copy length was less than half of the maximum gene copy length for each HSD, or if HSD candidates had gene copies with incomplete conserved domains (i.e., different number of Pfam domains). After filtering the combo threshold at (90%_30aa+90%_10aa), we added on a more relaxed threshold 90%_50aa (i.e., 90%_50aa+(90%_30aa+90%_10aa)) and then carried out the same HSD candidate removal/filtering process.

To minimize the redundancy and to acquire a larger dataset of HSD candidates, we processed each selected species with the following combination of thresholds: E + (D + (C + (B +A))).

@zx0223winner
Copy link
Owner Author

At the same time, since you have already mastered the usage of HSDFinder. if you interest in detecting more duplicates from your fish genomes or worry about missing any important duplicates genes, I would suggest you read the criteria we used to collect duplicates in HSDatabase, #7

To acquire more HSDs for each of your species, I will need you to re-run the HSDFinder with different thresholds, right now you only have 90_10 for each of your species (e.g., Aven.hsd.species.txt). Here, 90_10 represent 90% amino acid identity, within 10aa length difference, the complete 25 files for each of your species are :

90_10; 90_30;90_50;90_70;90_100;
80_10; 80_30;80_50;80_70;80_100;
70_10; 70_30;70_50;70_70;70_100;
60_10; 60_30;60_50;60_70;60_100;
50_10; 50_30;50_50;50_70;50_100;

You can do the batch work locally or run one at a time from online. So in total for your 14 species you will finally have 350 HSDs files, please label your file like below “species_name.number_number.txt” and place every 25 HSD files in 14 fish species folders, I have a custom script can run all files at a time.

Arabidopsis_thaliana.50_100.txt
Arabidopsis_thaliana.50_10.txt
Arabidopsis_thaliana.50_30.txt
Arabidopsis_thaliana.50_50.txt
Arabidopsis_thaliana.50_70.txt
Arabidopsis_thaliana.60_100.txt
Arabidopsis_thaliana.60_10.txt
Arabidopsis_thaliana.60_30.txt
Arabidopsis_thaliana.60_50.txt
Arabidopsis_thaliana.60_70.txt
Arabidopsis_thaliana.70_100.txt
Arabidopsis_thaliana.70_10.txt
Arabidopsis_thaliana.70_30.txt
Arabidopsis_thaliana.70_50.txt
Arabidopsis_thaliana.70_70.txt
Arabidopsis_thaliana.80_100.txt
Arabidopsis_thaliana.80_10.txt
Arabidopsis_thaliana.80_30.txt
Arabidopsis_thaliana.80_50.txt
Arabidopsis_thaliana.80_70.txt
Arabidopsis_thaliana.90_100.txt
Arabidopsis_thaliana.90_10.txt
Arabidopsis_thaliana.90_30.txt
Arabidopsis_thaliana.90_50.txt
Arabidopsis_thaliana.90_70.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant