Wikis for genome (re)annotation

Steven Salzberg (who is nominated for the Franklin award at bioinformatics.org) has an opinion piece in Genome Biology proposing wiki technology to help solve the problem of genome annotations getting out of date.

The problem comes down to how annotations are banked. Some people regard GenBank as the gold standard master for annotations, but it only provides a bank for the sequences which must be verifyied and curated. The RefSeq database provides that for a select number of organisms, and cannot keep up with the volume of genomes and annotations. In addition, there are no good ways to systematically improve annotations that are deposited in GenBank. For example, suppose a relatively naive gene calling algorithm was used on a genome for its publication, and improved programs have since been written. One would prefer to have the best gene calls available, even if both are only computational predictions with no experimental verification. But these may be only submitted as Third Party Annotations (TPA) and they need to published in a peer reviewed. In addition, annotations that can’t be submitted to TPA include

“Annotation that has arisen from an automated tool, such as GeneMark, tRNA scan or ORF finder, where no further evidence, experimental or otherwise, is presented for the annotation.”

So external sources have to serve as the repository of these annotations, leading people to create separate sites to maintain their data. This issue is part of the motivation for our site as well as one I created to represent genome re-annotation (and the initial annotation).

This idea of wikis for gene/genome information has been discussed recently including some letters (1, 2) to Nature, on the nodalpoint site (1, 2), by the Holmes group as a possible part of their AJAX enabled Gbrowse project.

We would benefit from such as a system, as we’ve reannotated a number of the published and poorly or completely unannotated genomes that have are basically orphaned if they aren’t part of a MOD. It is also hard to correct annotations in GenBank as systematic reannotations are not necessarily something that is accepted without a peer reviewed publication.

The problem right now is there is too much room for things to get lost, the sequencing centers are typically the ones annotating the genome, these are deposited in GenBank and taken as the standard for the annotation, and it is very difficult for individual (and perhaps not informatic saavy) researchers to correct the annotation for their genes of interest. The framework for this technology exists, but the infrastructure is not in place to streamline their contributions. Often the manually curated and automated predictions are merged by the annotation team in a semi-automated way that depends on the maturity (accuracy) of the genome sequence and the amount of existing annotation. Improving the the C. elegans genome annotation is probably going to need to be a lot different than the first round of annotation to the elephant genome.

Of course, having everyone being able to post gene predictions and annotations may not work. There will need to be gatekeepers to verify that submitted information is valid. These sorts of parsers could be written with an agreed upon set of rules on how to specify non-cannonical splice sites, partial-genes, pseudogenes, RNA editing, and other fun special cases. The work of projects like the Sequence Ontology have made formalizing these sorts of rules and grammars much more straight forward so that helper scripts can validate annotations.

5 thoughts on “Wikis for genome (re)annotation”

  1. Several organism databases are working on using wikis for annotation. The link on my name is for EcoliWiki, a prototype of a community annotation system we are setting up for E. coli as part of the new EcoliHub project. SGD is adding a wiki to replace it’s previous community annotation system, DictyBase, Xanthusbase, and Wormbase all have them.

    We also have created a wiki-based browser for the Gene Ontology at http://gowiki.tamu.edu.

    As noted in the post, there are a number of challenges to be met in using wikis for annotation. What will be interesting is to see how the wikis themselves can be used as a mechanism to create and refine protocols for community annotation via these new web media.

  2. The analysis of the 12 Drosophila genomes was done (is being done?) by wiki. I’m not sure if this wiki will be used now that the genome papers have been submitted for publication, but it could be a useful place to submit reannotations by different groups.

  3. These are all nice things, but setting up a wiki with downloadable data is not the same as a wiki for annotation of a genome. I don’t want users to have to be experts in informatics to be able to make an improvement to the gene set. But I’m not really designing or developing a genome wiki so I’ll stop complaining!

    The wikis that the MODs have setup are nice starts, but I don’t think that just installing mediawiki doesn’t really change the way that the information flows from researchers into a centralized DB and out back to the community. I realize we are just at the beginning here, but I’d like to see ways that gene and genome pages are made up of dynamic content that comes from databases of published data, curated content, and automated analyses.

    While I’m dreaming, what happens when we want to move away from single-gene oriented focus. What if we want to annotate function or attributes to a gene cluster. How can we build systems that aggregate in all the associated information from the different underlying gene informations. Maybe everything emits RSS and we build giant aggregators. LSID enabled of course… =)

  4. This is an excellent discussion and I’m glad to see these points being pursued further. The wiki sites of some organism databases are a good start, and I hope this will spread. Jason Stajich makes a good point with the argument that users shouldn’t have to be bioinformaticists in order to improve a gene set. This has been a repeated problem over the years for many genomes: users who are experts on a particular gene (but not necessarily on bioinformatics methods) cannot fix “their” gene when a new genome appears with that gene mis-annotated.
    In addition to fixing annotation though, my article was pointing out that we need a way to maintain and distribute genome annotation. Today (I would argue) that a large majority of scientists use Genbank/EMBL/DDBJ as their “go to” resource for genes for any species – I go there myself, all the time. I’m hoping that a new type of resource – perhaps a Wiki – will emerge to provide up-to-date annotation. Perhaps it would need a gatekeeper, but Wikipedia has worked pretty amazingly well with very minimal gatekeeping.

  5. I agree, content aggregation is going to be key. The other component that I think will be critical is formal semantics. These need to be captured from user input without exposing the user to the underlying complexity, yet they need to be integrated into a formal framework (such as RDF) that is computable, i.e., can be reasoned over, for example using semantic queries formulated in SPARQL (there are many examples on the web), or using UIs such as Swoogle. There are efforts underway to marry semantic web to Wiki, such as Semantic MediaWiki, which aim to have the best of both worlds, i.e., a wiki interface for editing data or annotation, and an RDF interface for machines (check out the Export RDF feature). One of the things I noted in the genome wiki examples is that the user is not supposed to edit the (machine-generated) formally structured parts, but instead uses “Notes” sections. I believe this needs to be integrated, not separated. The WikiProteins demo that apparently has been mentioned in a news article in Nature may provide some inspiration as to how this might work. Exposing obvious metadata as RDF that is now embedded in HTML, using microformats or not, may become less and less difficult; for example using browser plugins such as PiggyBank that use generic and custom screen scrapers to enable faceted browsing on (almost) ordinary web pages (anyone up for writing a custom PiggyBank screen scraper for a genome or taxonomic database?). In addition, there are emerging W3C standards such as GRDDL that might help in the transition. Just yesterday I’ve just listened to an interesting talk in this regard by Harry Halpin, a co-chair of the GRDDL working group.

    BTW interestingly very similar ideas in the area of communal content editing and portals that enable that are being discussed in the Biodiversity field, for example in the highly ambitious Encyclopedia of Life project (see, for example, the draft technical vision).

    As an idea of where this might all lead to, imagine ordinary users could program the web of genome annotations, the tree of life, and ontologies of gene function, anatomy, and disease, using an interface like Yahoo Pipes, just as it allows them now to program the RSS-enabled web.

    As for community annotation of genomes, has anyone who reads this tried to poke around on SEED? I’ve used this extensively when I reconstructed (part of) the metabolic network of T. maritima in the context of a graduate class at UCSD taught by B. Palsson, and I found it the most useful of all genome annotation sources for Thermotoga.

Leave a Reply