Wikis for genome (re)annotation

Steven Salzberg (who is nominated for the Franklin award at bioinformatics.org) has an opinion piece in Genome Biology proposing wiki technology to help solve the problem of genome annotations getting out of date.

The problem comes down to how annotations are banked. Some people regard GenBank as the gold standard master for annotations, but it only provides a bank for the sequences which must be verifyied and curated. The RefSeq database provides that for a select number of organisms, and cannot keep up with the volume of genomes and annotations. In addition, there are no good ways to systematically improve annotations that are deposited in GenBank. For example, suppose a relatively naive gene calling algorithm was used on a genome for its publication, and improved programs have since been written. One would prefer to have the best gene calls available, even if both are only computational predictions with no experimental verification. But these may be only submitted as Third Party Annotations (TPA) and they need to published in a peer reviewed. In addition, annotations that can’t be submitted to TPA include

“Annotation that has arisen from an automated tool, such as GeneMark, tRNA scan or ORF finder, where no further evidence, experimental or otherwise, is presented for the annotation.”

So external sources have to serve as the repository of these annotations, leading people to create separate sites to maintain their data. This issue is part of the motivation for our site as well as one I created to represent genome re-annotation (and the initial annotation).

This idea of wikis for gene/genome information has been discussed recently including some letters (1, 2) to Nature, on the nodalpoint site (1, 2), by the Holmes group as a possible part of their AJAX enabled Gbrowse project.

We would benefit from such as a system, as we’ve reannotated a number of the published and poorly or completely unannotated genomes that have are basically orphaned if they aren’t part of a MOD. It is also hard to correct annotations in GenBank as systematic reannotations are not necessarily something that is accepted without a peer reviewed publication.

The problem right now is there is too much room for things to get lost, the sequencing centers are typically the ones annotating the genome, these are deposited in GenBank and taken as the standard for the annotation, and it is very difficult for individual (and perhaps not informatic saavy) researchers to correct the annotation for their genes of interest. The framework for this technology exists, but the infrastructure is not in place to streamline their contributions. Often the manually curated and automated predictions are merged by the annotation team in a semi-automated way that depends on the maturity (accuracy) of the genome sequence and the amount of existing annotation. Improving the the C. elegans genome annotation is probably going to need to be a lot different than the first round of annotation to the elephant genome.

Of course, having everyone being able to post gene predictions and annotations may not work. There will need to be gatekeepers to verify that submitted information is valid. These sorts of parsers could be written with an agreed upon set of rules on how to specify non-cannonical splice sites, partial-genes, pseudogenes, RNA editing, and other fun special cases. The work of projects like the Sequence Ontology have made formalizing these sorts of rules and grammars much more straight forward so that helper scripts can validate annotations.