Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The content of the .exonic_variant_function file is empty. #270

Open
leiwang567 opened this issue Jan 7, 2025 · 5 comments
Open

The content of the .exonic_variant_function file is empty. #270

leiwang567 opened this issue Jan 7, 2025 · 5 comments

Comments

@leiwang567
Copy link

Hello, thank you very much for taking the time to help me with my confusion. I have to admit that ANNOVAR is a very useful annotation tool. However, I am currently facing a tricky problem and I am seeking your help. Previously, I used minimap2 and SyRI to perform sequence alignment and SV identification on the whole genomes of chimpanzees and bonobos. Now, I am using ANNOVAR to annotate the structural variations in the output files from SyRI.
For the annotation database, I used the latest chimpanzee genome sequence .fasta and .gff3 annotation files to build it myself, resulting in pantro_refGeneMrna.fa and pantro_refGene.txt. When I used annotate_variation.pl for gene-based annotation, the result file pantro-panpan.exonic_variant_function was empty, and the first column of the pantro-panpan.variant_function file was all intergenic. Moreover, when I used the table_annovar.pl script to re-annotate with a single annotation database (the RefSeq database I built myself), the result files were pantro-panpan.pantro_multianno.csv and pantro-panpan.refGene.invalid_input. The content of the pantro-panpan.pantro_multianno.csv file is shown in the image below. I am very puzzled about this situation. What could be the reason for the above situation? Or are there any mistakes in my operation process? I sincerely hope for your answer, as this is very important to me!
98c6d99994e5d29f7cd454158f5b9945

@kaichop
Copy link
Contributor

kaichop commented Jan 7, 2025

This means that the variants are not annotated to any chromosome. It could be due to many reasons, so you want to manually check pantro_refGene files to see what's wrong, for example, "1" instead of "chr1" is used as chromosome name, or that the location (start-end) is based on transcript rather than assembly (chr1), etc.

Without providing any details at all, I cannot tell where the issue is. Please read FAQ #1.

@leiwang567
Copy link
Author

For the annotation library, I used the latest chimpanzee genome data to build it myself. First, I converted the PanTro.gff3 file to pantro.gtf using the command “gffread PanTro.gff3 -T -o pantro.gtf”. Next, I converted the GTF file to a GenePred file using the command “gtfToGenePred -genePredExt pantro.gtf pantro_refGene.txt”. Finally, I generated the pantro_refGeneMrna.fa file using the command “retrieve_seq_from_fasta.pl --format refGene --seqfile PanTro.fna pantro_refGene.txt --out pantro_refGeneMrna.fa”. Here are all the files mentioned above.Looking forward to your reply.
87f343e3b95d8c80b8bd65bb28dc9c4c
5380f64d259c25159a53636af2d05f8e
d55628947875bb9fad9e11492ae53bf4
eaa5f6ad6e3a179b8fcb453cbe76df7d

@leiwang567
Copy link
Author

I'm sorry to bother you again, but your reply is very important to me. I'm very confused about the issue mentioned above and don't know where to start. I sincerely hope you can offer some suggestions for correcting or improving it. Best regards!

@kaichop
Copy link
Contributor

kaichop commented Jan 9, 2025

The pantro_refGeneMrna.fa file looks fine to me. The refGene.txt file also looks okay to me.

If you just send me the files (or just the first a few hundred genes), then perhaps I can check them further to see what is the issue. One possibility is that many genes may not have the correct ORF (in your figure, that gene has the warning), so there is no output, but I need to see the file to know the percentage.

Also what command did you run and what is the LOG/NOTICE message after you run it? I need to see it to advise where things are wrong as I mentioned in FAQ #1.

@leiwang567
Copy link
Author

Since my input file is too large, extracting only part of the content cannot ensure consistency. Therefore, please forgive me for directly sending you the links to the data sources. Could you please help me test them? The links for the chimpanzee genome and the .gff3 format annotation file are as follows:
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/028/858/775/GCF_028858775.2_NHGRI_mPanTro3-v2.0_pri/GCF_028858775.2_NHGRI_mPanTro3-v2.0_pri_genomic.fna.gz
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/028/858/775/GCF_028858775.2_NHGRI_mPanTro3-v2.0_pri/GCF_028858775.2_NHGRI_mPanTro3-v2.0_pri_genomic.gff.gz

Firstly, for the construction of the annotation library named pantrodb. After decompressing the aforementioned files, I used the following code to convert the gff file to a gtf file.
gffread pantro.gff3 -T -o pantro.gtf
Use the following code to convert the GTF file to a GenePred file.
gtfToGenePred -genePredExt pantro.gtf pantro_refGene.txt
Finally, generate the panpan_refGeneMrna.fa file.
retrieve_seq_from_fasta.pl --format refGene --seqfile PanTro.fna pantro_refGene.txt --out pantro_refGeneMrna.fa

Next, for the annotation part. Previously, I used SyRI to identify structural variations between the chimpanzee and bonobo genomes, and used the result file syri.vcf as the input file for annotation.
Here, I have uploaded the compressed package of syri.vcf named syri.vcf.tar.gz. The link is https://github.com/leiwang567/chimpanzee-data/syri.vcf.tar.gz

Use the following command to convert the .vcf file to an .avinput file.
convert2annovar.pl -format vcf4 -allsample -withfreq -includeinfo syri.vcf -outfile syri_pantro.avinput
When running this step, the prompt content is as follows.
257e90e67ab8e26f306ad703541d3925

After that, use table_annovar.pl for annotation with the following command.
table_annovar.pl ./syri_pantro.avinput pantrodb/ -buildver pantro -out pantro-panpan -remove -protocol refGene -operation g -nastring . -csvout > pantro-panpan.stdout.log 2> pantro-panpan.stderr.log
The link for the panthro-panpan.stderr.log file is https://github.com/leiwang567/chimpanzee-data/pantro-panpan.stderr.log
The content of part of the panthro-panpan.pantro_multianno.csv file is as follows.
ab0455c560ddb5214f93606eae5d9135

After that, I chose to use another script annotate_variation.pl for annotation. The command is as follows.
annotate_variation.pl --geneanno --buildver pantro -dbtype refGene syri_pantro.avinput --outfile pantro-panpan ../pantrodb/
The content of the result file panthro-panpan.exonic_variant_function is empty. Part of the content of the panthro-panpan.variant_function file is as follows.The first column content of the file is entirely "intergenic".
3ccd4532353f39b731efa3e4466d6416

The link for the log file panthro-panpan.log is https://github.com/leiwang567/chimpanzee-data/pantro-panpan.log

I believe that the annotation results do not show any variants related to exons, which I think is incorrect. Therefore, I would appreciate it if you could provide me with some suggestions for improvement after testing. Thank you very much! Best regards!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants