Tag Archives: databases

Where can I get orthologs?

There are several databases that include orthology prediction for fungi. These all have pros and cons. Some are more comprehensive and have many more species. Some are curated orthologies and paralogy which should be pretty stable. Some are automated and groupings and ortholog group IDs change at each iteration.

  • A phylogenetic approach from a Saccharomyces perspective is at PhylomeDB.
  • Fungal Orthogroups is based on Synergy algorithm from I. Wapinski formerly of the Regev group at the Broad Institutue.
  • Yeast gene order browser (YGOB) for Saccharomyces spp and CGOB for Candida spp.
  • OrthoMCL database based on whole genomes, not a ton of fungi but useful starting set.
  • Ensembl Genomes provides ortholog prediction as part of the Compara pipeline though there is a limited phylogenetic diversity in the current Ensembl Fungal genomes.
  • TreeFam has Saccharomyces cerevisiae and Schizosaccharomyces pombe as the two fungi included in the curated ortholog assignments and phylogenies.
  • SIMAP provides pre-computed similarities among all proteins in UniProt.
  • InParanoid provides a pretty comprehensive of available 100 whole genomes and many fungal genomes which I tried to help select.
  • JGI’s Mycocosm attempts to provide a fungal focused paralog/gene family look at clusters of genes based on whole genomes
  • E-Fungi is also an attempt at automated clustering with some fancy webservices logic.
  • Fungal Transcription Factor database focused just on families of transcription factors.

Some of these tools are better than others in terms of providing downloadable tables.  Another problem is what Identifiers are used. Many biologists are using gene names or Locus identifiers not UniProt/GenPept IDs to identify genes or proteins of interest.  So tools that just cluster UniProt data aren’t as useful as those which refer to the gene or locus names. Also, providing a way to download all the data from a comparison is important for further mining and grouping of the data or cross-referencing local datasets.  One-by-one plugging in geneids is not really a tool that respects the idea that your user wants to ask sophisticated queries.

Also – beware that some approaches are very much pairwise comparisons lists whereas others are finding orthologous groupings.  So if you want to fine the Rad59 ortholog from all fungi it may be easier or harder depending on the source.

[I may make this a static page in the future to allow for more detailed updating since I know the available resources wax and wane]

Preparing for meeting on Fungal Genome databases

Washington monument at nightNext week a collection of international scientists with stakes in seeing fungal genome databases evolve and rise to meet the tide of genome data being produced and analyzed from fungi will be meeting in DC.  I am hopeful we’ll come up with some strategies and principles that can guide how this data can be more effectively managed and provided to researchers.  This includes web-based resources, tools, and simply adhering to a standardized formats for genome annotations (like GFF3), automated methods for gene ontology associations on newly annotated genomes, and integration of what I expect to be the major amount of data in the years to come: individual lab produced  genomic, ChIP, resequencing, and RNA-sequencing results. This means the integration (and sharing) of individual labs produced genomic data with the public data will be key along with cross-species comparisons of this information.  Tools like Ensembl and UCSC-browser provide great portals for animal data and some plant data with a few fungi sprinkled in as outgroups. (Okay UCSC does have some data for close relatives to Saccharomyces data in their “other clade” that provides data from the Phastcons paper and Ensembl is now serving up a few Fungi). Tools like Phytozome are attempting to integrate some of the plant genomic data in one place as well.  However the resources for fungal researchers with a wide collection of highly detailed manually curated genomes to shotgun sequenced and automated annotation are available and the tools to search, compare, and integrate are still insufficient for what is needed by the community.

I expect will also be discussing how databases that incorporate the data from all the genomes can have some centralized aspects so comparative analyses are possible, and importantly, how can these types of resources be sustainably funded by public and private money.

Fungi are important in a wide variety of human and ecosystem processes, from pathogens of agriculture crops to human disease causing to symbiotic relationships with plants to industrial agents in food, chemical, and biofuel production. The study of them needs modern tools including genomic resources for molecular studies of these species.  The current tools and data are quite useful and important in our current research but with the increasing amount of new sequence and phenotype data, and a need to effectively connect data from different experimental, model, and pathogen study systems needs to be much improved.

I hope to provide some updates on what are some of the ideas of what we discuss about “Pan-fungal” genome resources and will be interested in helping engage a wider audience on how tools and resources should be built to meet our needs as researchers.