The Hyphal Tip

Digesting the fungal genomes

The Hyphal Tip header image 2

A word about databases

Posted on July 27th, 2008 by Jason Stajich · 1 Comment

Logo for fungal GenomesReport concludes that a fungal genome database is of “the highest priority”.

This is the title as listed in PubMed for this article from Future Medicine about the AAM report on charting future needs and avenues of research on the fungal kingdom.

The need for a comprehensive database for information about fungi, starting at least with systematic collections of genomic and transcript data, is highlighted as a major need.  Really and sort of new database effort should strive to be more comprehensive and include genetic and population data (alleles, strains) and information like protein-protein, protein-nucleic acid interactions (as Pedro mentioned). But on top of that it, it needs to be comparative so that information from systems that serve as great models can be transferred to other fungal systems that are being studied for their role as pathogens or interacting in the environmental.

Affordable next-gen sequencing will allow us to obtain genome and transcript sequence for basically all species or strains of interest.  Researchers with no bioinformatics support in their lab will likely be able to outsource this to a company or campus core facility.  But how can they easily map in the collective information about genes, proteins, and pathways onto this new data?  And have it be a dynamic system that can update as new information is published and curated in other systems.

I think this has to be the future beyond setting up a SGD, CGD, etc for every system.  The individual databases are useful for a large enough community where there are curators (and funding), but we will have to move to a more modular system in the future (aspects of which are in GMOD) that can have both an individual focus on a specific species/clade and a more comprehensive view of the that is comparable across the kingdom.  There are 100+ fungal genomes, but the community size for some of them are in the dozens of labs or less. How can they take advantage of the new resources without an existing infrastructure of curators?  Their systems serve an important need in a research aim, but how can discoveries there make its way back into the datastream of othe systems?

I see it as there are several ways one would interact with a system that provided single-genome tools as well as a framework for comparative information.  At a gene level, one might be looking for all information about a specific gene, based on sequence similarity searches, or starting with a cloned gene in one species. Something akin to Phylofacts or precomputed Orthogroups for defining a Gene but with more linking information about function by linking in information from all sources.  So a comparative resource, but also tapping into curated andliterature mined data.

At a genome level, one might want to do whole genome comparisons of gene content from evolutionarily defined families genes (gene family size change) or at a functional level.  To start out with, each gene/protein would already need a systematic functional mapping.  This could be as simple as running InterProScan on every protein, expanded to find Orthogroups (or OrthoMCL orthologs) and transfer function from model systems, and finally even more advanced, do further classified better with tools like SIFTER.

Interlinked with these orthologous and paralogous gene sets would be anchors for analyses of chromosomal synteny and even comparative assembly including tools like Mercator.  Certainly things like all of this exist but making it more pluggable for different sets of species would be an important additional component.

At a utility level, the gene annotation and functional mapping of all this information should be possible. I would imagine a researcher could upload the sequence assembly they received from the core facility and the system can generate multiple gene predictions, annotate the genes, and link these genes within the known orthogroups of the system (preserving their privacy for these genes if desired).  Presumably this sort of thing would be easier as a standalone in-house for the researcher, but web services could also be the place for this.

For fungal-sized genomes this amount of data is not too extereme.  Things like Genome Browser, BLAST, etc should all be rolled out of the box based on the basic builds.

On the DIY and community annotation front, there would also need to be a layer of community derived annotation that could be layered on all these systems.  I would imagine this both to be for gene structure annotation (genome annotation) and functional annotation (protein X does Y based on experiment Z, here is the journal reference).  I think aspects of this would be visible, auditable (tracked), but maybe not blessed as official until a curator could oversee these inputs. In my mind, whether or not this is in a Wiki per se or just new system that allows community input is less important to me than having it be a) structured (not a bunch of free text) b) tracked and versionable c) easy for researchers to input so that the knowledge is captured, even if it has to be reorganized later on.

Seems like a lot of work to be done, but really many of these things already exist through what  the GMOD project has built.  Many loose ends and software that doesn’t fully meet up to these needs, but I think the important concept is these are all general solutions that will be of benefit to most communities, not just the fungal ones.  One lingering question I always have when approaching genomic datas

that will be dynamic, what if any of this makes its way into GenBank?  How is this sort of thing banked so that it can be captured, and does the improved functional or gene structure annotation ever make its way into the repository databases to correct and improve what has already been submitted there?

Categories: opinion
Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

1 response so far ↓

Leave a Comment

  • July 31, 2008 at 8:31 pm Pedro Beltrao
    STRING, actually comes close to this concept of multi-species database. I assume they can just add genomes to STRING and get all the annotations and predictions transferred to it (it could be a bit more open ;). One problem would be how to combine the current single species databases (ex SGD) with the multi species database so that information that is added by curators or large datasets to the model species is accessible to other species. This multi-species database could also serve as a platform. Say for example that someone develops an excellent predictor for some feature it would be great to be able to plug it in to the database and let people use it. That would create a little market for predictors that could be used by anyone else. Something similar to what cytoscape is doing at network level analysis but genome centric. (I am ignoring computation time costs :)
  • July 31, 2008 at 8:36 pm Deepak
    Pedro, that's the key. The database must become a service, otherwise you're just limiting yourself
  • July 31, 2008 at 10:35 pm Lars Juhl Jensen
    The issue of computation time is precisely the problem. To do proper transfer of predictions you need to detect orthology, which requires complete genome comparisons to be done. In other words, to have a web service that transfers annotation to one gene in one genome, that service would need as input the entire genome that the gene is a part of. After that, one would need to do hours if not days of computation on running BLAST searches. This is why we make heavy use of precomputed results in STRING.
  • July 31, 2008 at 10:43 pm Lars Juhl Jensen
    Concerning the possibility of plugging predictors into genome browsers, this is pretty much the idea of the Distributed Annotation System (DAS). It allows you to set up any predictor that you make with an interface that enables people to view its predictions within, for example, the Ensembl genome browser. Still, it obviously requires that your prediction service is fast enough to be able to handle the load.
  • July 31, 2008 at 10:46 pm Deepak
    Lars, what are the bottlenecks outside of computational performance of the prediction service? Are there any you could point me to?
  • July 31, 2008 at 11:08 pm Todd Harris
    It's clear at this point that there won't be funding for a database of every organism (or even groups of closely related species). How can we recreate what MODs have done on a smaller scale? First, we need simple, out-of-the-box tools that allow even the smallest labs to easily parse, analyze, and present/host common data types. Bio DBs need to implement web services as Deepak points out, and we need standard frameworks for presentation. And as Lars notes, we need precomputation to unify resources.
  • July 31, 2008 at 11:54 pm Lars Juhl Jensen
    Deepak, do you mean bottlenecks with respect to making a web service or bottlenecks related to integration of multiple genomes in general?
  • July 31, 2008 at 11:57 pm Deepak
    Making a web service
  • August 1, 2008 at 12:09 am Lars Juhl Jensen
    The bottlenecks depend largely on how the data are distributed. If all the data relevant to a given service is already gathered on a single server, then I see few bottlenecks. But if the data are distributed, then I would expect major trouble in terms of network latency and in some cases bandwidth. I will give a couple of examples as separate comments (due to FriendFeed length restrictions).
  • August 1, 2008 at 12:10 am Deepak
    Thanks!!!
  • August 1, 2008 at 12:15 am Lars Juhl Jensen
    Protein interaction data: Imagine that each interaction database (BioGRID, IntAct, MINT, DIP, etc.) each made their interactions available via web services. To get the relevant set of interactions, I would now have to query each of these databases with the gene of interest as well as query each of them with the orthologous genes every other genome. This can easily amount to thousands of web service requests just to fetch one kind of data.
  • August 1, 2008 at 12:18 am Lars Juhl Jensen
    Text mining: Unless the corpus of text to be mined resides on the same server as the web service that performs the text mining, you would have to send all of Medline across the network to do a query. Even if it resides locally, you need precomputed results to get decent speed - at least in the form of an index.
  • August 1, 2008 at 12:21 am Deepak
    Got it. That's a pretty clear example. But if you had the appropriate infrastructure and the ability to scale it, in theory, you could achieve that, correct? The datasets and algorithms are large, but not prohibitively large.
  • August 1, 2008 at 12:22 am Deepak
    I completely agree on the index bit. Would be nice to have something spidering the life science web
  • August 1, 2008 at 12:24 am Lars Juhl Jensen
    Orthology detection: This requires that all genomes are stored (or at least cached) at a single server as one would otherwise have to send entire genomes around for every web service request. Moreover, precomputation is clearly needed to get a response time that is not on the scale of hours or days. Just for the record, I have burned in the order of 60 CPU years on precomputing alignments for the upcoming version 8 of STRING.
  • August 1, 2008 at 12:26 am Deepak
    Lars ... we should talk about this offline. I have a million questions
  • August 1, 2008 at 12:31 am Lars Juhl Jensen
    I think the trick is to not try to make the services too "atomic". One would need some sort of caching/precomputing meta-services to make this scale. For example, one could imagine having an orthology metaserver that gathers the sequences from genome database web service and automatically updates the precomputed results when the genome annotation changes. Similarly, a meta-service could gather all the relevant interaction data, assigns quality scores, and transfers them by orthology.
  • August 1, 2008 at 6:57 am Deepak
    Caching/pre-computation will be critical if you want performance. I can't see it working any other way, even if you have all the iron in the world
  • August 1, 2008 at 8:03 am Todd Harris
    Computation is handled locally. For example, a lab predicts ortholog/paralog assignments for their genome against genomes of interest. Unified IDs, web services, and generic presentation allow layering of data via mashup.

Add a comment on FriendFeed