The content of the .exonic_variant_function file is empty. #270

leiwang567 · 2025-01-07T06:23:51Z

Hello, thank you very much for taking the time to help me with my confusion. I have to admit that ANNOVAR is a very useful annotation tool. However, I am currently facing a tricky problem and I am seeking your help. Previously, I used minimap2 and SyRI to perform sequence alignment and SV identification on the whole genomes of chimpanzees and bonobos. Now, I am using ANNOVAR to annotate the structural variations in the output files from SyRI.
For the annotation database, I used the latest chimpanzee genome sequence .fasta and .gff3 annotation files to build it myself, resulting in pantro_refGeneMrna.fa and pantro_refGene.txt. When I used annotate_variation.pl for gene-based annotation, the result file pantro-panpan.exonic_variant_function was empty, and the first column of the pantro-panpan.variant_function file was all intergenic. Moreover, when I used the table_annovar.pl script to re-annotate with a single annotation database (the RefSeq database I built myself), the result files were pantro-panpan.pantro_multianno.csv and pantro-panpan.refGene.invalid_input. The content of the pantro-panpan.pantro_multianno.csv file is shown in the image below. I am very puzzled about this situation. What could be the reason for the above situation? Or are there any mistakes in my operation process? I sincerely hope for your answer, as this is very important to me!

kaichop · 2025-01-07T14:35:22Z

This means that the variants are not annotated to any chromosome. It could be due to many reasons, so you want to manually check pantro_refGene files to see what's wrong, for example, "1" instead of "chr1" is used as chromosome name, or that the location (start-end) is based on transcript rather than assembly (chr1), etc.

Without providing any details at all, I cannot tell where the issue is. Please read FAQ #1.

leiwang567 · 2025-01-08T03:10:37Z

For the annotation library, I used the latest chimpanzee genome data to build it myself. First, I converted the PanTro.gff3 file to pantro.gtf using the command “gffread PanTro.gff3 -T -o pantro.gtf”. Next, I converted the GTF file to a GenePred file using the command “gtfToGenePred -genePredExt pantro.gtf pantro_refGene.txt”. Finally, I generated the pantro_refGeneMrna.fa file using the command “retrieve_seq_from_fasta.pl --format refGene --seqfile PanTro.fna pantro_refGene.txt --out pantro_refGeneMrna.fa”. Here are all the files mentioned above.Looking forward to your reply.

leiwang567 · 2025-01-09T02:02:55Z

I'm sorry to bother you again, but your reply is very important to me. I'm very confused about the issue mentioned above and don't know where to start. I sincerely hope you can offer some suggestions for correcting or improving it. Best regards!

kaichop · 2025-01-09T19:51:53Z

The pantro_refGeneMrna.fa file looks fine to me. The refGene.txt file also looks okay to me.

If you just send me the files (or just the first a few hundred genes), then perhaps I can check them further to see what is the issue. One possibility is that many genes may not have the correct ORF (in your figure, that gene has the warning), so there is no output, but I need to see the file to know the percentage.

Also what command did you run and what is the LOG/NOTICE message after you run it? I need to see it to advise where things are wrong as I mentioned in FAQ #1.

leiwang567 · 2025-01-10T04:09:09Z

Since my input file is too large, extracting only part of the content cannot ensure consistency. Therefore, please forgive me for directly sending you the links to the data sources. Could you please help me test them? The links for the chimpanzee genome and the .gff3 format annotation file are as follows:
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/028/858/775/GCF_028858775.2_NHGRI_mPanTro3-v2.0_pri/GCF_028858775.2_NHGRI_mPanTro3-v2.0_pri_genomic.fna.gz
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/028/858/775/GCF_028858775.2_NHGRI_mPanTro3-v2.0_pri/GCF_028858775.2_NHGRI_mPanTro3-v2.0_pri_genomic.gff.gz

Firstly, for the construction of the annotation library named pantrodb. After decompressing the aforementioned files, I used the following code to convert the gff file to a gtf file.
gffread pantro.gff3 -T -o pantro.gtf
Use the following code to convert the GTF file to a GenePred file.
gtfToGenePred -genePredExt pantro.gtf pantro_refGene.txt
Finally, generate the panpan_refGeneMrna.fa file.
retrieve_seq_from_fasta.pl --format refGene --seqfile PanTro.fna pantro_refGene.txt --out pantro_refGeneMrna.fa

Next, for the annotation part. Previously, I used SyRI to identify structural variations between the chimpanzee and bonobo genomes, and used the result file syri.vcf as the input file for annotation.
Here, I have uploaded the compressed package of syri.vcf named syri.vcf.tar.gz. The link is https://github.com/leiwang567/chimpanzee-data/syri.vcf.tar.gz

Use the following command to convert the .vcf file to an .avinput file.
convert2annovar.pl -format vcf4 -allsample -withfreq -includeinfo syri.vcf -outfile syri_pantro.avinput
When running this step, the prompt content is as follows.

After that, use table_annovar.pl for annotation with the following command.
table_annovar.pl ./syri_pantro.avinput pantrodb/ -buildver pantro -out pantro-panpan -remove -protocol refGene -operation g -nastring . -csvout > pantro-panpan.stdout.log 2> pantro-panpan.stderr.log
The link for the panthro-panpan.stderr.log file is https://github.com/leiwang567/chimpanzee-data/pantro-panpan.stderr.log
The content of part of the panthro-panpan.pantro_multianno.csv file is as follows.

After that, I chose to use another script annotate_variation.pl for annotation. The command is as follows.
annotate_variation.pl --geneanno --buildver pantro -dbtype refGene syri_pantro.avinput --outfile pantro-panpan ../pantrodb/
The content of the result file panthro-panpan.exonic_variant_function is empty. Part of the content of the panthro-panpan.variant_function file is as follows.The first column content of the file is entirely "intergenic".

The link for the log file panthro-panpan.log is https://github.com/leiwang567/chimpanzee-data/pantro-panpan.log

I believe that the annotation results do not show any variants related to exons, which I think is incorrect. Therefore, I would appreciate it if you could provide me with some suggestions for improvement after testing. Thank you very much! Best regards!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The content of the .exonic_variant_function file is empty. #270

The content of the .exonic_variant_function file is empty. #270

leiwang567 commented Jan 7, 2025

kaichop commented Jan 7, 2025

leiwang567 commented Jan 8, 2025

leiwang567 commented Jan 9, 2025

kaichop commented Jan 9, 2025

leiwang567 commented Jan 10, 2025

The content of the .exonic_variant_function file is empty. #270

The content of the .exonic_variant_function file is empty. #270

Comments

leiwang567 commented Jan 7, 2025

kaichop commented Jan 7, 2025

leiwang567 commented Jan 8, 2025

leiwang567 commented Jan 9, 2025

kaichop commented Jan 9, 2025

leiwang567 commented Jan 10, 2025