A cacophony of comparative genomics papers

A nice series of comparative genomics articles have been published in the last few weeks. The pace of genome sequencing has accelerated to the point that we have lots of sequencing projects coming from individual labs and small consortia not necessarily from genome centers. We are seeing a preview of what next (2nd) generation sequencing will enable and can start to imagine what happens when even cheaper 3rd generation sequencing technologies are applied. I’m behind in reviewing these papers for you, dear reader, but I hope you’ll click through and take a look at some of these papers if you are interested in the topics.

In the following set of papers we have some nice examples of comparative genomics of closely related species and among a clade of species. The papers mentioned below include our work on the human pathogens Coccidioides and Histoplasma (Sharpton et al) studied at several evolutionary distances, a study on Saccharomycetaceae (Souciet et al) clade of yeast species, and a comparison of two species of Candida (Jackson et al): the commensal and opportunistic fungal pathogen Candida albicans with a very closely related species Candida dubliensis.  There is also a nice comparison of strains of Saccharomyces cerevisiae looking at effects of domestication and examples of horizontal transfer.

There is also a report of de novo sequencing of a filamentous fungus using several approaches, traditional Sanger sequencing, 454, and Illumina/Solexa (DiGuistini et al).

Finally, a paper from a few months ago (Ma et al), gives a fantastic look at one of the early branches in the fungal tree – the Mucorales (formerly Zygomycota) – via the genome of Rhizopus oryzae.  This paper is a really excellent example of what we can learn about a group of species by contrasting genomic features in the early branches in the tree with the more well studied Ascomycete and Basidiomycete fungi.  More genome sequences will help us build on these findings and clarify if some of the observations are unique to the lineage or universal aspects of the earliest fungi.

I hope you enjoy!

Early branching genomes available

Genome sequencing is underway on several early branches in the Opisthokont and some related linages as part of the “Origins of Multicellularity” project at the Broad Institute (BI) include some recently made available assemblies for:

  • Allomyces macrogynus (Blastocladiomycota “Chytrid”)
  • Capsaspora owczarzaki (Ichthyosporea)

Already available data from

Still in progress (BI)

Still in progress (Other centers)

For your reading pleasure

Too much on my plate as of late, so I’m woefully behind on posting much on interesting papers or news.  Here’s a short list of links and papers that are worth a look though.

  • “Evolution of pathogenicity and sexual reproduction in eight Candida genomes” published (Nature)
  • NYT Science article sort of summarizing the good, bad, and ugly of fungi and human interactions
  • Attempts to save amphibians from chytridiomycosis “Riders of a Modern-Day Ark” (PLoS Biology)
  • Looks like Scott Baker with the JGI are in the process of resequencing several classical mutant strains of Phycomyces, Neurospora and Cochliobolus, Cryphonectria for sequence-based mapping of mutants (i.e. here and here and here).

Yeast population genomics

I have cheered the Sanger-Wellcome SGRP group work to generate multiple Saccharomyces cerevisiae and S. paradoxus strain genome sequences.   The group had previously submitted a version of the manuscript to Nature precedings and it is now published in Nature AOP showing that submitting to a preprint server doesn’t necessarily hurt your manuscript getting published…  The research groups explored the impact of domestication (as was also recently done for the sake and soy sauce worker fungus, Aspergillus oryzae) on the Saccharomyces genome by comparing individuals from wild strains of S. paradoxus.

This paper addressed several challenges including methodology for light genome sequencing for population genomics. This data represents in a way, a pilot project on for genome resequencing projects and using draft genome sequencing with next generation sequencing tools. Of course with the pace of sequencing technology development, any project more than a couple months old will be using outdated technology it seems, but this work represents some important progress.  Tools like MAQ were also developed and tuned as part of the project.  In addition to the methods development it also provided a new look at evolutionary dynamics of a well-studied fungus.

Genome assembly
The authors apply several different quality controls and utilize a new tool called PALAS (Parallel ALignment and ASsembly)  to assemble all the strains at the same time using a graph-based approach that utilized the reference genome sequences for each species. This is different than a full-blown WGA approach like PCAP, Phusion or Arachne because this is deliberately low-coverage sequencing pass.  The authors are trying impute missing sequence via Ancestral Recombination Graphs as implemented in the Margarita system.   They also use MAQ to align sequence from Illumina/Solexa sequencing to these assemblies made by PALAS.

Since this project was on two species of SaccharomycesS. cerevisiae and S. paradoxus they needed good reference assemblies for each of these species. The previously availably S.paradoxus assembly wasn’t complete enough for this study so they did an addition 4.3 X coverage with sanger/ABI sequencing and 80X coverage with Illumina.

Population genomics and domestication

The sequencing data also provided a framework for population genetic investigations. Some simple findings showed that geographic isolates within each species were more genetically similar to each other.  The main geographic regions of samples for S.paradoxus data included the UK, American, and Far East samples, some of which had been analyzed in a very nice study on Chromosome III.  For the S. cerevisiae samples there were individuals from around Europe, at least 10 European wine strains, Malaysian, Sake brewing strains, West Africa, and North America. From these data it was possible to discover that there are several of strains with mosiac genomes meaning that pieces of the genome match best with the sake fermentation strains and other parts from the wine/European samples.

Efforts to detect the effects of natural selection that may be linked to domestication of these strains explored two different approaches. The McDonald-Kreitman test did not identify any loci under positive selection while Tajima’s D was negative in the S.cerevisiae global and wine strain populations indicating an excess of singleton polymorphisms – though they draw little conclusions from that.  The authors also observed a sharper decay of linkage disequilibrium in S.cerevisiae (half maximum of 3kb) than S.paradoxus (half maximum 9kb) suggesting that S.cerevisiae is recombining more, either due to increased opportunities or a great frequency of recombination events when it does.

In context of the paper title and the idea of exploring the effects of domestication on the genome, the authors observe that the standard paradigm that ‘domesticated’ species have lower diversity levels is simply not the case in these samples.  This isn’t to say there isn’t evidence of the selection for fermentation production from these strains based on the stress response conditions they were tested on, but that there is still ample evidence of maintaining diversity within the populations presumably through various amounts of outcrossing.

We are also interested in these results as we apply similar questions to population genomics of the human pathogenic fungus Coccidioides where 14 strains have been sequenced with sanger sequencing technology.  Hopefully some of these lessons will resonate in our analyses and also that this era of population genomics will see ever more extensive collections to address aspects of migration, phylogeography, and local adaptations within populations of fungi and other microbes.

Melampsora larici-populina genome sequenced

From Francis Martin

The DNA sequence of Melampsora larici-populina has been determined by the U.S. Department of Energy DOE Joint Genome Institute (DOE JGI). Annotations of the v1.0 assembly of Melampsora laricis-populina are publicly available at http://www.jgi.doe.gov/Melampsora.
Genome analyses have been carried out by an international consortium comprised of DOE JGI, France’s National Institute for Agricultural Research (F Martin et al., INRA-Nancy), Canadian Forest Service (R Hamelin et al., Laurentian Forestry Centre), and the Bioinformatics & Evolutionary Genomics Division (Rouzé et al., Gent University) in Belgium.

The poplar leaf rust fungus Melampsora is the most devastating and widespread pathogen of poplars, and has limited the use of poplars for environmental and wood production goals in many parts of the world. All known poplar cultivars are susceptible to Melampsora species, and new virulent strains are continuously developing. This disease therefore has a strong potential impact on current and future poplar plantations used for production of forest products (principally pulp and consolidated wood products), carbon sequestration, biofuels production, and bioremediation.

Lichen genome projects and the power shift prompted by next-gen sequencing

Genome Technology highlights the very cool thing about next-gen sequencing – it puts the power in the hands of the researchers to explore genome sequence and doesn’t limit them to projects only funded through sequencing centers. The Genome Technology piece highlights work at Duke to sequence the genome Cladonia grayi, a lichenized fungus, with 454 technology at Duke’s Institute for Genome Sciences and Policy through their next-gen sequencing program. This is the way of the future where sequencing core facilities will be able to generate sequence only having to wait in the queue at the own university rather than through community sequencing project or sequencing center proposal queues.

This isn’t the only lichen being sequenced. Xanthoria parietina is also in the queue at JGI, but has taken a while to get going because of some logistical problems getting the DNA (and any problems are amplified because it takes a long time to get new material since lichens grow very slow).

The transfer of the power for researchers to be able to quick exploratory whole-genome sequencing with next-gen and eventually, high quality genome sequences from next-gen sequencing is predicted to transform how this kind of science gets done. It means we’ll probably just sequence a mutant strain instead of trying to map the mutation – this is happening already in anecdotal stories in worms and in our work in mushrooms. N.B. this is done after a mutagenized strain has been cleaned up a bit to insure we’re looking for one or only a few mutations based on some crosses – but that is part of standard genetic approaches anyways.

This fast,cheap,whole-genome-sequencing is also the stuff of personal genomics, but for basic research it will also mean that a first pass exploring gene repertoire of an organism will be a multi-week instead of multi-year project. I just hope we’re training enough people who can efficiently extract the information from all this data with solid bioinformatics, computational, data-oriented programming, and statistical skills to support all the labs that will want to take this approach. You’ll need a life-vest to swim in the big data pool for a while until more tools are developed that can be deployed by non-experts.

Dermatophyte genome sequences

The first of several dermatophyte fungal genomes, Microsporum gypseum, has been released at the Broad’s Dermatophyte site.  Two Tricophyton species and another Microsporum genome should follow soon. These dermatophyte fungi are Onygenales (Ascomycota) fungi (like Coccidioides and Histoplasma), although their placement in the phylogenies shown in the whitepaper and related review paper is a bit ambiguous. I’m sure that can be improved with a few more gene sequences gleaned from the genomes.

The 23 Mb M. gypseum genome is a bit smaller than the sizes of C. immitis (28 Mb), H. capsulatum (32 Mb), or Paracoccidioides brasiliensis (29 Mb).  While no annotation is currently available for the M. gypseum genome, this genome will help in establishing what genes were ancestral in the Onygenales and comparing patterns of gene family gains and losses in fungi that specialize on animal hosts.

Some more comparison across different kinds of dermatophyte fungi that are very distantly related like dandruff causing fungus Malasezzia globosa (Basidiomycota) will be really interesting as well.

Thanks Joe H and FGI folks for passing along announcement and to the Broad/FGI folks for the work to make this sequence available.

A word about databases

Logo for fungal GenomesReport concludes that a fungal genome database is of “the highest priority”.

This is the title as listed in PubMed for this article from Future Medicine about the AAM report on charting future needs and avenues of research on the fungal kingdom.

The need for a comprehensive database for information about fungi, starting at least with systematic collections of genomic and transcript data, is highlighted as a major need.  Really and sort of new database effort should strive to be more comprehensive and include genetic and population data (alleles, strains) and information like protein-protein, protein-nucleic acid interactions (as Pedro mentioned). But on top of that it, it needs to be comparative so that information from systems that serve as great models can be transferred to other fungal systems that are being studied for their role as pathogens or interacting in the environmental.

Affordable next-gen sequencing will allow us to obtain genome and transcript sequence for basically all species or strains of interest.  Researchers with no bioinformatics support in their lab will likely be able to outsource this to a company or campus core facility.  But how can they easily map in the collective information about genes, proteins, and pathways onto this new data?  And have it be a dynamic system that can update as new information is published and curated in other systems.

I think this has to be the future beyond setting up a SGD, CGD, etc for every system.  The individual databases are useful for a large enough community where there are curators (and funding), but we will have to move to a more modular system in the future (aspects of which are in GMOD) that can have both an individual focus on a specific species/clade and a more comprehensive view of the that is comparable across the kingdom.  There are 100+ fungal genomes, but the community size for some of them are in the dozens of labs or less. How can they take advantage of the new resources without an existing infrastructure of curators?  Their systems serve an important need in a research aim, but how can discoveries there make its way back into the datastream of othe systems?

I see it as there are several ways one would interact with a system that provided single-genome tools as well as a framework for comparative information.  At a gene level, one might be looking for all information about a specific gene, based on sequence similarity searches, or starting with a cloned gene in one species. Something akin to Phylofacts or precomputed Orthogroups for defining a Gene but with more linking information about function by linking in information from all sources.  So a comparative resource, but also tapping into curated andliterature mined data.

At a genome level, one might want to do whole genome comparisons of gene content from evolutionarily defined families genes (gene family size change) or at a functional level.  To start out with, each gene/protein would already need a systematic functional mapping.  This could be as simple as running InterProScan on every protein, expanded to find Orthogroups (or OrthoMCL orthologs) and transfer function from model systems, and finally even more advanced, do further classified better with tools like SIFTER.

Interlinked with these orthologous and paralogous gene sets would be anchors for analyses of chromosomal synteny and even comparative assembly including tools like Mercator.  Certainly things like all of this exist but making it more pluggable for different sets of species would be an important additional component.

At a utility level, the gene annotation and functional mapping of all this information should be possible. I would imagine a researcher could upload the sequence assembly they received from the core facility and the system can generate multiple gene predictions, annotate the genes, and link these genes within the known orthogroups of the system (preserving their privacy for these genes if desired).  Presumably this sort of thing would be easier as a standalone in-house for the researcher, but web services could also be the place for this.

For fungal-sized genomes this amount of data is not too extereme.  Things like Genome Browser, BLAST, etc should all be rolled out of the box based on the basic builds.

On the DIY and community annotation front, there would also need to be a layer of community derived annotation that could be layered on all these systems.  I would imagine this both to be for gene structure annotation (genome annotation) and functional annotation (protein X does Y based on experiment Z, here is the journal reference).  I think aspects of this would be visible, auditable (tracked), but maybe not blessed as official until a curator could oversee these inputs. In my mind, whether or not this is in a Wiki per se or just new system that allows community input is less important to me than having it be a) structured (not a bunch of free text) b) tracked and versionable c) easy for researchers to input so that the knowledge is captured, even if it has to be reorganized later on.

Seems like a lot of work to be done, but really many of these things already exist through what  the GMOD project has built.  Many loose ends and software that doesn’t fully meet up to these needs, but I think the important concept is these are all general solutions that will be of benefit to most communities, not just the fungal ones.  One lingering question I always have when approaching genomic datas

that will be dynamic, what if any of this makes its way into GenBank?  How is this sort of thing banked so that it can be captured, and does the improved functional or gene structure annotation ever make its way into the repository databases to correct and improve what has already been submitted there?

AAM Releases “The Fungal Kingdom” Report

AAM The Fungal Kindgom Report CoverThe American Academy of Microbiology has released a report (PDF and archived on fungalgenomes.org) on the Fungal Kingdom outlining importance of research in the kingdom and recommending several areas of priority for future areas of research.

One recommendation that makes the top of the list is an integrated database for fungal genomes, something we’re keenly interested in seeing happen.  This sort of centralized repository of functional annotation, literature links, and genome sequences and annotation is critical given the 150+ genomes that are available or on their way.  Systematic re-annotation with consistent tools, comparative analyses and gene predictions, and linking gene sequences by homology and ortholog predictions are a critical component to fully utilizing the genomic data that has been produced for the fungi and other organisms.