AgamP3.4/Notes on genebuild

From VectorBase Help System

Jump to: navigation, search

AgamP3.4 was released at VectorBase in July 2007. The same annotation is visible at Ensembl from release 45.

The gene, transcript and protein identifiers are now in VectorBase format, instead of Ensembl format. Please see the separate document for details and guidance on mapping old identifiers to new ones.

[edit] Changes from previous gene set AgamP3.3

Significant differences from the previous annotation include:

  • Many more manually-appraised gene models, including most models on chromosome arm 2L.
  • Better identification of repeats (especially transposons) leading to a reduction in models that may be transposon-derived.
  • Improved handling of community-provided annotation.
  • Use of selected VectorBase transcript models from Aedes aegypti as an additional evidence source
  • Improvements to protein-based models due to better parameterization of GeneWise

[edit] Details of genebuild

The AgamP3.4 gene annotation was prepared by combining sets of transcript models made by different approaches.

  1. Manually-curated models (including alternative transcripts).
  2. Models built with GeneWise using Anopheles proteins (from public databases or contributed directly by Anopheles researchers) were given EST-based extensions, and merged to give a non-redundant set (allowing alternative transcripts).
  3. Models built with GeneWise using arthropod proteins (primarily from Drosophila (FlyBase 4.3 set) and Aedes (AaegL1.1 that have EST support), plus selected Arthropoda entries from Uniprot) were given EST-based extensions where possible, and merged to give a non-redundant set (not allowing alternative transcripts).
  4. EST-based models built solely from A. gambiae ESTs using the ClusterMerge algorithm.
  5. Protein-based models built with GeneWise using other Metazoa entries from Uniprot were given EST-based extensions (rarely possible), and merged to give a non-redundant set (not allowing alternative transcripts).
  6. SNAP ab initio predictions that have an identifiable Pfam domain but do not overlap with repeats.

The final gene set was produced by the progressive addition of models from the different approaches. First, Set 1 & 2 transcripts were combined, giving priority to manually-curated models in cases of conflict. Set 3 genes were then added, but only where there was no overlap with a Set 1/2 model. Genes from sets 4, 5 and 6 were similarly added in turn, only where there was no overlap with a higher priority model.

In addition, tRNA genes were predicted using the program tRNAScan-SE, and a small number of miRNA genes were predicted by homology with miRBase entries.

References:

GeneWise and its use within the Ensembl system for gene model annotation.

  • E.Birney et al., Genome Res. 2004 14:988-95
  • V.Curwen et al., Genome Res. 2004 14:942-50

EST alignment to genomes using Exonerate.

  • G. Slater et al., BMC Bioinformatics. 2005 6:31

Cluster-Merge algorithm.

  • E.Eyras et al., Genome. Res. 2004 14:976-87

Ab initio gene finding by SNAP.

  • I. Korf, BMC Bioinformatics. 2004 5:59

tRNAscan-SE.

  • T. Lowe et al., Nucleic Acids Res. 1997 25:955-64

miRBase.

Personal tools