(re)Annotating GenBank

NCBI LogoTom Bruns, Martin Bidartondo and 250 others sent a letter to Science describing the current problems with fixing annotation in GenBank. There is an entertaining accompanying news article that interviews several people about the problem of updating annotation and species assigned to sequences in the database. In particular the problem for mycologists that many fungi found from metagenomic approaches are only identified through molecular sequences and having the wrong species associated with a sequence can be difficult when studying community ecology composition.  This problem is not limited to fungi by any means, but recent reports find as many as 20% of fungal Intergenic Spacer (ITS) sequences are mis-attributed to the wrong species. 

There’s a nice quote in the news article from Steven Salzberg talking about the difficulties in getting sequences, especially from big centers, updated. I’m sure he is thinking of many examples, like reclassifying some Drosophila sequence traces.

The issues stems from the rules of engagement for GenBank.  The databases are setup so that only the original submitting author (or their designated proxy) may update a sequence record.  This is because the database is an archive of sequences and so rightly, it would be a bad idea if anyone can monkey around with the primary data.  The third party annotation (TPA) is intended to be mechanism for adding annotation (such as gene annotation to a genome) but it must be backed up by experimental data. This experimental data definition can be difficult – how do you reconcile an improper species designation for an ITS sequence from a cloned library from soil sample?  Well how do you know it is wrong — phylogenetic comparison which may include some closely-related species — but computational analyses alone is not enough to submit a record to TPA.  As more work to “barcode” species like fungi and application of automated methods can build phylogenies and detect inconsistencies it will be important to correctly associate sequence with organisms.

Similarly if you want to correct a gene model that someone else annotated, you must provide experimental evidence and simply aligning (already sequenced and in GenBank) EST to identify missing improperly called exons (that might have come from a sequencing center that only used computational predictions to annotate!).  The TPA:Inferential is supposed to be a place for this kind of thing, but we have still had problems updating mis-annotated genes when publishing a paper on a gene family (ironically where we are also part of the genome project).

While a Wiki is not really what I would propose to solve this problem, what is happening instead is the creation of separate databases which link to and reannotate the primary data that is stored in GenBank.  These databases and projects have been on-going for a while like the European LSU database.  Hopefully the mycological herbarium will also be the centers of this work where vouchered specimens or cultures, images. micrographs, and DNA sequence can all be maintained and used to fully describe a specimen.

