Tag Archives: database

Updated Cryptococcus serotype A annotation

SEM of clamp cell, yeast cells and sexual spore chains. Courtesy R. Velagapudi & J. Heitman

A new and improved annotation of Cryptococcus neoformans var grubii strain H99 (serotype A) has been made available in GenBank and the Broad Institute website. This update is collaboration between several groups providing data and analyses and the genome annotation team at the Broad Institute.

Some changes noted by the Broad Institute include:

“This release of gene predictions for the serotype A isolate Cryptococcus neoformans var. grubii H99 is based on a new genomic assembly provided by Dr. Fred Dietrich at the Duke Center for Genome Technology. The new assembly consists of 14 nuclear chromosomes and a single 21 KB mitochondrial chromosome, and has resulted in a reduction of the estimated genome size from 19.5 to 18.9 Mb. Improvements in the assembly and in our annotation process have resulted in a set of 6,967 predicted protein products, 335 fewer than the previous release.”

Fungal P450s

A paper (Park et al, BMC Genomics) from Fungal Bioinformatics Lab at Seoul University in South Korea describes their new “Fungal P450 Database”. The database contains sequence, names, and genome links for P450′s (or Cytochrome P450s) identified by similarity and phylogenetic classification from genome annotations.  The group is using most of annotated genomes in GenBank (and I think some from elsewhere) of bacterial, fungi, animals, and plants.

I find the current nomenclature for this family of genes confusing but it has been I am sure a difficult job and wrangled to a large part by David Nelson (who also has a new paper on the CYPome of Aspergillus nidulans). I have found it difficult to follow the logic for naming these members, as it didn’t seem to be particularly phylogenetic at first, although I think that has improved. However, a stable and solid reference database is needed to for naming these gene members and for mapping new members in through straightforward analyses is an essential resource. Park et al have made great inroads to that end and it may indeed meet needs (I am cautious to say it is solved without more exploration or some sense of whether it is intended or will be taken up as just that sort of reference by the P450 community).  It has seemed to me that a proper phylogenetic (or really, a phylogenomic) approach is essential for naming the P450 member genes as orthologous or paralogous members across multiple species. The group has defined their classes as clusters of homologs (e.g. Mg004 is Magnaporthe grisea gene in Cluster 9.1) and linked these also to the Nelson nomeclature (CYP68E1).  By defining orthologous family members we can make more interpretations about how to transfer functional annotation in a truly phylogenomic context. 

The overall family is so large and diverse (they report 4538 fungal P450s into 141 clusters/sub-families from 68 species) across many different species. The fungi tend to have very large families in some clades (e.g. some filamentous fungi) so I think this type of systematic and searchable system that will have stable identities for clusters is an essential resource. I know I’m going to try and give it a whirl. We have a couple of cool findings about changes in the P450 families in Basidiomycete Coprinopsis and related species comparisons that I hope we’ll be able to better interpret with this additional phylogenomic naming of gene family members.

Jongsun Park, Seungmin Lee, Jaeyoung Choi, Kyohun Ahn, Bongsoo Park, Jaejin Park, Seogchan Kang, Yong-Hwan Lee (2008). Fungal cytochrome P450 database BMC Genomics, 9 (1) DOI: 10.1186/1471-2164-9-402

A word about databases

Logo for fungal GenomesReport concludes that a fungal genome database is of “the highest priority”.

This is the title as listed in PubMed for this article from Future Medicine about the AAM report on charting future needs and avenues of research on the fungal kingdom.

The need for a comprehensive database for information about fungi, starting at least with systematic collections of genomic and transcript data, is highlighted as a major need.  Really and sort of new database effort should strive to be more comprehensive and include genetic and population data (alleles, strains) and information like protein-protein, protein-nucleic acid interactions (as Pedro mentioned). But on top of that it, it needs to be comparative so that information from systems that serve as great models can be transferred to other fungal systems that are being studied for their role as pathogens or interacting in the environmental.

Affordable next-gen sequencing will allow us to obtain genome and transcript sequence for basically all species or strains of interest.  Researchers with no bioinformatics support in their lab will likely be able to outsource this to a company or campus core facility.  But how can they easily map in the collective information about genes, proteins, and pathways onto this new data?  And have it be a dynamic system that can update as new information is published and curated in other systems.

I think this has to be the future beyond setting up a SGD, CGD, etc for every system.  The individual databases are useful for a large enough community where there are curators (and funding), but we will have to move to a more modular system in the future (aspects of which are in GMOD) that can have both an individual focus on a specific species/clade and a more comprehensive view of the that is comparable across the kingdom.  There are 100+ fungal genomes, but the community size for some of them are in the dozens of labs or less. How can they take advantage of the new resources without an existing infrastructure of curators?  Their systems serve an important need in a research aim, but how can discoveries there make its way back into the datastream of othe systems?

I see it as there are several ways one would interact with a system that provided single-genome tools as well as a framework for comparative information.  At a gene level, one might be looking for all information about a specific gene, based on sequence similarity searches, or starting with a cloned gene in one species. Something akin to Phylofacts or precomputed Orthogroups for defining a Gene but with more linking information about function by linking in information from all sources.  So a comparative resource, but also tapping into curated andliterature mined data.

At a genome level, one might want to do whole genome comparisons of gene content from evolutionarily defined families genes (gene family size change) or at a functional level.  To start out with, each gene/protein would already need a systematic functional mapping.  This could be as simple as running InterProScan on every protein, expanded to find Orthogroups (or OrthoMCL orthologs) and transfer function from model systems, and finally even more advanced, do further classified better with tools like SIFTER.

Interlinked with these orthologous and paralogous gene sets would be anchors for analyses of chromosomal synteny and even comparative assembly including tools like Mercator.  Certainly things like all of this exist but making it more pluggable for different sets of species would be an important additional component.

At a utility level, the gene annotation and functional mapping of all this information should be possible. I would imagine a researcher could upload the sequence assembly they received from the core facility and the system can generate multiple gene predictions, annotate the genes, and link these genes within the known orthogroups of the system (preserving their privacy for these genes if desired).  Presumably this sort of thing would be easier as a standalone in-house for the researcher, but web services could also be the place for this.

For fungal-sized genomes this amount of data is not too extereme.  Things like Genome Browser, BLAST, etc should all be rolled out of the box based on the basic builds.

On the DIY and community annotation front, there would also need to be a layer of community derived annotation that could be layered on all these systems.  I would imagine this both to be for gene structure annotation (genome annotation) and functional annotation (protein X does Y based on experiment Z, here is the journal reference).  I think aspects of this would be visible, auditable (tracked), but maybe not blessed as official until a curator could oversee these inputs. In my mind, whether or not this is in a Wiki per se or just new system that allows community input is less important to me than having it be a) structured (not a bunch of free text) b) tracked and versionable c) easy for researchers to input so that the knowledge is captured, even if it has to be reorganized later on.

Seems like a lot of work to be done, but really many of these things already exist through what  the GMOD project has built.  Many loose ends and software that doesn’t fully meet up to these needs, but I think the important concept is these are all general solutions that will be of benefit to most communities, not just the fungal ones.  One lingering question I always have when approaching genomic datas

that will be dynamic, what if any of this makes its way into GenBank?  How is this sort of thing banked so that it can be captured, and does the improved functional or gene structure annotation ever make its way into the repository databases to correct and improve what has already been submitted there?

AAM Releases “The Fungal Kingdom” Report

AAM The Fungal Kindgom Report CoverThe American Academy of Microbiology has released a report (PDF and archived on fungalgenomes.org) on the Fungal Kingdom outlining importance of research in the kingdom and recommending several areas of priority for future areas of research.

One recommendation that makes the top of the list is an integrated database for fungal genomes, something we’re keenly interested in seeing happen.  This sort of centralized repository of functional annotation, literature links, and genome sequences and annotation is critical given the 150+ genomes that are available or on their way.  Systematic re-annotation with consistent tools, comparative analyses and gene predictions, and linking gene sequences by homology and ortholog predictions are a critical component to fully utilizing the genomic data that has been produced for the fungi and other organisms.

Trichoderma reesei genome paper published

TrichodermaThe Trichoderma reesei genome paper was recently published in Nature Biotechnology from Diego Martinez at LANL with collaborators at JGI, LBNL, and others. This fungus was chosen for sequencing because it was found on canvas tents eating the cotton material suggesting it may be a good candidate for degrading cellulose plant material as part of cellulosic ethanol or other biofuels production.  The fungus also has starring roles in industrial processes like making stonewashed jeans due to its prodigious cellulase production.

The most surprising findings from the paper include the fact that there are so few members of some of the enzyme families even though this fungus is able to generate enzymes with so much cellulase activity. The authors found that there is not a significantly larger number of glucoside hydrolases which is a collection of carbohydrate degrading enzymes great for making simple sugars out of complex ones. In fact, several plant pathogens compared (Fusarium graminearum and Magnaporthe grisea) and the sake fermenting Aspergillus oryzae all have more members of this family than does.  T. reesei has almost the least (36) copies of a cellulose binding domain (CBM) of any of the filamentous ascomycete fungi.  They used the CAZyme database (carbohydrate active enzymes) database which has done a fantastic job building up profiles of different enzymes involved in carhohydrate degradation binding, and modifications.

Whether T. reesei is really the best cellulose degrading fungus is definitely an open question.  That it works well in the industrial culture that it has been utilized in is important, but there may be other species of fungi with improved cellulase activity and who may in fact have many more copies of cellulases.  So it will be good to add other fungi to the mix with quantitative information about degradation to try and glean what are the most important combination of enzymes and activities.

One technical note.  The comparison of copy number differences employed in the paper is a simple enough Chi-Squared, work that I’ve done with Matt Hahn and others include a gene family size comparison approach that also taked into account phylogenetic distances and assumes a birth-death process of gene family size change.  It would be great to apply the copy number differences through this or other approaches that just evaluate gene trees for these domains to see where the differences are significant and if they can be polarized to a particular branch of the tree.

So will this genome sequence lead to cheaper, better biofuel production? Certainly it provides an important toolkit to start systematically testing individual cellulase enzymes. It’s hard to say how fast this will make an impact, but the work of JBEI and a host of other research groups and biotech companies are going to be able to systematically test out the utility of these individual enzymes.

There is also evolutionary work by other groups on the evolution of these Hypocreales fungi trying to better define when biotrophic and heterotrophic transitions occurred to sample fungi with different lifestyles that might have different cellulase enyzmes that may not have been observed. Defining the relationships of these fungi and when and how many times transitions to lifestyles occurred to choose the most diverse fungi may be an important part of discovering novel enzymes.

Also see

Martinez, D., Berka, R.M., Henrissat, B., Saloheimo, M., Arvas, M., Baker, S.E., Chapman, J., Chertkov, O., Coutinho, P.M., Cullen, D., Danchin, E.G., Grigoriev, I.V., Harris, P., Jackson, M., Kubicek, C.P., Han, C.S., Ho, I., Larrondo, L.F., de Leon, A.L., Magnuson, J.K., Merino, S., Misra, M., Nelson, B., Putnam, N., Robbertse, B., Salamov, A.A., Schmoll, M., Terry, A., Thayer, N., Westerholm-Parvinen, A., Schoch, C.L., Yao, J., Barbote, R., Nelson, M.A., Detter, C., Bruce, D., Kuske, C.R., Xie, G., Richardson, P., Rokhsar, D.S., Lucas, S.M., Rubin, E.M., Dunn-Coleman, N., Ward, M., Brettin, T.S. (2008). Genome sequencing and analysis of the biomass-degrading fungus Trichoderma reesei (syn. Hypocrea jecorina). Nature Biotechnology DOI: 10.1038/nbt1403

(re)Annotating GenBank

NCBI LogoTom Bruns, Martin Bidartondo and 250 others sent a letter to Science describing the current problems with fixing annotation in GenBank. There is an entertaining accompanying news article that interviews several people about the problem of updating annotation and species assigned to sequences in the database. In particular the problem for mycologists that many fungi found from metagenomic approaches are only identified through molecular sequences and having the wrong species associated with a sequence can be difficult when studying community ecology composition.  This problem is not limited to fungi by any means, but recent reports find as many as 20% of fungal Intergenic Spacer (ITS) sequences are mis-attributed to the wrong species. 

There’s a nice quote in the news article from Steven Salzberg talking about the difficulties in getting sequences, especially from big centers, updated. I’m sure he is thinking of many examples, like reclassifying some Drosophila sequence traces.

Continue reading

Some links


I’ve been too busy to post much these last few days, but here are a few links to some papers I found interesting in my recent browsing.

Schmitt, I., Partida-Martinez, L.P., Winkler, R., Voigt, K., Einax, E., Dölz, F., Telle, S., Wöstemeyer, J., Hertweck, C. (2008). Evolution of host resistance in a toxin-producing bacterial–fungal alliance. The ISME Journal DOI: 10.1038/ismej.2008.19

LEVASSEUR, A. (2008). FOLy: an integrated database for the classification and functional annotation of fungal oxidoreductases potentially involved in the degradation of lignin and related aromatic compounds. Fungal Genetics and Biology DOI: 10.1016/j.fgb.2008.01.004

Shivaji, S., Bhadra, B., Rao, R.S., Pradhan, S. (2008). Rhodotorula himalayensis sp. nov., a novel psychrophilic yeast isolated from Roopkund Lake of the Himalayan mountain ranges, India. Extremophiles DOI: 10.1007/s00792-008-0144-z


Robin reviews recent Nature paper by Ilan Wapinski et al describing the orthogroups they built from multiple fungal genomes. I’ve been remiss in reviewing the paper myself, but they’ve created an important resource in the SYNERGY tool for orthology identification and a database of orthologs of some ascomycete fungi. I am excited there is a level of interest in the properties of gene duplication and how this may be an important aspect of adaptation and evolution. corn smut

The Cornell Mushroom blog has a nice treatment of the maize pathogen and Mexican delicacy Ustilago maydis corn smut.

Chris and Tom took some more Coprinus pictures while I was away from the lab.