Skip to content

序列文件格式(Sequence File Formats)

Ricky Woo edited this page Sep 20, 2017 · 7 revisions

1. TAB-separated GTF files

1.1 Columns

column-number content values/format
1 chromosome name chr{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,M}
2 annotation source {ENSEMBL,HAVANA}
3 feature-type {gene,transcript,exon,CDS,UTR,start_codon,stop_codon,Selenocysteine}
4 genomic start location integer-value (1-based)
5 genomic end location integer-value
6 score (not used)  .
7 genomic strand {+,-}
8 genomic phase (for CDS features)  {0,1,2,.}
9 additional information as key-value pairs see below

1.2 Column 9

Mandatory key:value pairs

  <table>
    <thead>
      <tr>
        <th>
          key name
        </th>
        <th>
          value format
        </th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>
          gene_id
        </td>
        <td>
          ENSGXXXXXXXXXXX *
        </td>
      </tr>
      <tr>
        <td>
          transcript_id
        </td>
        <td>
          ENSTXXXXXXXXXXX *
        </td>
      </tr>
      <tr>
        <td>
          gene_type
        </td>
        <td>
          <a href="gencode_biotypes.html">list of biotypes</a>
        </td>
      </tr>
      <tr>
        <td>
          gene_status
        </td>
        <td>
          {KNOWN, NOVEL, PUTATIVE}
        </td>
      </tr>
      <tr>
        <td>
          gene_name
        </td>
        <td>
          string
        </td>
      </tr>
      <tr>
        <td>
          transcript_type
        </td>
        <td>
          <a href="gencode_biotypes.html">list of biotypes</a>
        </td>
      </tr>
      <tr>
        <td>
          transcript_status
        </td>
        <td>
          {KNOWN, NOVEL, PUTATIVE}
        </td>
      </tr>
      <tr>
        <td>
          transcript_name
        </td>
        <td>
          string
        </td>
      </tr>
      <tr>
        <td>
          exon_number
        </td>
        <td>
          indicates the biological position of the exon in the transcript
        </td>
      </tr>
      <tr>
        <td>
          exon_id
        </td>
        <td>
          ENSEXXXXXXXXXXX *
        </td>
      </tr>
      <tr>
        <td>
          level
        </td>
        <td>
          1 (verified loci),<br />
          2 (manually annotated loci),<br />
          3 (automatically annotated loci)
        </td>
      </tr>
    </tbody>
  </table> 

Optional fields

  <table>
    <thead>
      <tr>
        <th>
          key name
        </th>
        <th>
          value format
        </th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>
          tag
        </td>
        <td>
          part of a special set [*]: &#160;{pseudo_consens,CCDS,seleno};<br />
          or annotation remarks ["cds_start_NF", "mRNA_end_NF", etc.]<br />
          <a href="gencode_tags.html">list of tags</a>
        </td>
      </tr>
      <tr>
        <td>
          ccdsid
        </td>
        <td>
          official CCDS id [*]; &#160;CCDS*
        </td>
      </tr>
      <tr>
        <td>
          havana_gene
        </td>
        <td>
          gene-id in the havana db [0,1];&#160; OTTHUMG*
        </td>
      </tr>
      <tr>
        <td>
          havana_transcript
        </td>
        <td>
          transcript-id in the havana db [0,1] ; &#160;OTTHUMT*
        </td>
      </tr>
      <tr>
        <td>
          protein_id
        </td>
        <td>
          ENSPXXXXXXXXXXX [0,1] (Ensembl protein id of protein coding transcript)
        </td>
      </tr>
      <tr>
        <td>
          ont
        </td>
        <td>
          pseudogene (or other) ontology ids [*]; &#160;{PGO:0000004 and others}
        </td>
      </tr>
      <tr>
        <td>
          transcript_support_level
        </td>
        <td>
          transcripts are scored according to how well mRNA and EST alignments match over its full length [0,1]<br />
          1 (all splice junctions of the transcript are supported by at least one non-suspect mRNA),<br />
          2 (the best supporting mRNA is flagged as suspect or the support is from multiple ESTs),<br />
          3 (the only support is from a single EST),<br />
          4 (the best supporting EST is flagged as suspect),<br />
          5 (no single transcript supports the model structure),<br />
          NA (the transcript was not analyzed)
        </td>
      </tr>
    </tbody>
  </table>

1.4 Example GTF File

chr21   HAVANA  transcript      10862622        10863067        .       +       .       gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
chr21   HAVANA  exon    10862622        10862667        .       +       .       gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
chr21   HAVANA  CDS     10862622        10862667        .       +       0       gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
chr21   HAVANA  start_codon     10862622        10862624        .       +       0       gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
chr21   HAVANA  exon    10862751        10863067        .       +       .       gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
chr21   HAVANA  CDS     10862751        10863064        .       +       2       gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
chr21   HAVANA  stop_codon      10863065        10863067        .       +       0       gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";
chr21   HAVANA  UTR     10863065        10863067        .       +       .       gene_id "ENSG00000169861"; transcript_id "ENST00000302092"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "IGHV1OR15-5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "IGHV1OR15-5-001"; level 2; havana_gene "OTTHUMG00000074130"; havana_transcript "OTTHUMT00000157419";

A bioinformatics wiki for the course BI462.

Clone this wiki locally