Tag Archives: methods

Yeast population genomics

I have cheered the Sanger-Wellcome SGRP group work to generate multiple Saccharomyces cerevisiae and S. paradoxus strain genome sequences.   The group had previously submitted a version of the manuscript to Nature precedings and it is now published in Nature AOP showing that submitting to a preprint server doesn’t necessarily hurt your manuscript getting published…  The research groups explored the impact of domestication (as was also recently done for the sake and soy sauce worker fungus, Aspergillus oryzae) on the Saccharomyces genome by comparing individuals from wild strains of S. paradoxus.

This paper addressed several challenges including methodology for light genome sequencing for population genomics. This data represents in a way, a pilot project on for genome resequencing projects and using draft genome sequencing with next generation sequencing tools. Of course with the pace of sequencing technology development, any project more than a couple months old will be using outdated technology it seems, but this work represents some important progress.  Tools like MAQ were also developed and tuned as part of the project.  In addition to the methods development it also provided a new look at evolutionary dynamics of a well-studied fungus.

Genome assembly
The authors apply several different quality controls and utilize a new tool called PALAS (Parallel ALignment and ASsembly)  to assemble all the strains at the same time using a graph-based approach that utilized the reference genome sequences for each species. This is different than a full-blown WGA approach like PCAP, Phusion or Arachne because this is deliberately low-coverage sequencing pass.  The authors are trying impute missing sequence via Ancestral Recombination Graphs as implemented in the Margarita system.   They also use MAQ to align sequence from Illumina/Solexa sequencing to these assemblies made by PALAS.

Since this project was on two species of SaccharomycesS. cerevisiae and S. paradoxus they needed good reference assemblies for each of these species. The previously availably S.paradoxus assembly wasn’t complete enough for this study so they did an addition 4.3 X coverage with sanger/ABI sequencing and 80X coverage with Illumina.

Population genomics and domestication

The sequencing data also provided a framework for population genetic investigations. Some simple findings showed that geographic isolates within each species were more genetically similar to each other.  The main geographic regions of samples for S.paradoxus data included the UK, American, and Far East samples, some of which had been analyzed in a very nice study on Chromosome III.  For the S. cerevisiae samples there were individuals from around Europe, at least 10 European wine strains, Malaysian, Sake brewing strains, West Africa, and North America. From these data it was possible to discover that there are several of strains with mosiac genomes meaning that pieces of the genome match best with the sake fermentation strains and other parts from the wine/European samples.

Efforts to detect the effects of natural selection that may be linked to domestication of these strains explored two different approaches. The McDonald-Kreitman test did not identify any loci under positive selection while Tajima’s D was negative in the S.cerevisiae global and wine strain populations indicating an excess of singleton polymorphisms – though they draw little conclusions from that.  The authors also observed a sharper decay of linkage disequilibrium in S.cerevisiae (half maximum of 3kb) than S.paradoxus (half maximum 9kb) suggesting that S.cerevisiae is recombining more, either due to increased opportunities or a great frequency of recombination events when it does.

In context of the paper title and the idea of exploring the effects of domestication on the genome, the authors observe that the standard paradigm that ‘domesticated’ species have lower diversity levels is simply not the case in these samples.  This isn’t to say there isn’t evidence of the selection for fermentation production from these strains based on the stress response conditions they were tested on, but that there is still ample evidence of maintaining diversity within the populations presumably through various amounts of outcrossing.

We are also interested in these results as we apply similar questions to population genomics of the human pathogenic fungus Coccidioides where 14 strains have been sequenced with sanger sequencing technology.  Hopefully some of these lessons will resonate in our analyses and also that this era of population genomics will see ever more extensive collections to address aspects of migration, phylogeography, and local adaptations within populations of fungi and other microbes.

Gianni Liti, David M. Carter, Alan M. Moses, Jonas Warringer, Leopold Parts, Stephen A. James, Robert P. Davey, Ian N. Roberts, Austin Burt, Vassiliki Koufopanou, Isheng J. Tsai, Casey M. Bergman, Douda Bensasson, Michael J. T. O’Kelly, Alexander van Oudenaarden, David B. H. Barton, Elizabeth Bailes, Alex N. Nguyen, Matthew Jones, Michael A. Quail, Ian Goodhead, Sarah Sims, Frances Smith, Anders Blomberg, Richard Durbin, Edward J. Louis (2009). Population genomics of domestic and wild yeasts Nature DOI: 10.1038/nature07743

Fun with estimating divergence times


Estimating divergence times is notorious difficult and the field can be downright rancorous with some being accused of reading tea leaves and chicken entrails – interesting reading for personalities as much as the different scientific approaches. There are several different approaches to trying to estimate a divergence time among species, using calibration points usually anchored by fossil data. Molecular clock methods have problems sometimes producing extremely old dates that are quite hotly debated. In fungi we have a very few fossils (and their placement on the phylogeny is debated).

There are quite a few available methods for reconstructing divergence times including r8s and multidivtime which start with various types of trees and use calibration time points that are typically informed by fossil dates. The simplest approaches assume a molecular clock (rates are same across the tree) and then one only needs to calibrate the number of substitutions (or rate really) to time to determine how phylogenetic tree branch lengths map to time. The BEAST package also does phylogenetic inference and divergence time estimation (and provided the necessary analysis for exoneration of the Tripoli Six) across a sample of trees. BEAST (and MrBayes) use MCMC to sample the space of parameters and tree space to identify phylogenies and evolutionary parameters that explain the data (an alignment of sequences).

A paper from Akerborg and colleagues introduces a new approach that uses MCMC but apply a few twists, using a birth-death model that doesn’t assume a molecular clock and employing a hill-climbing algorithm instead of MCMC to find parameter optima. They use a Maximum a posterior (MAP) framework which is more computational efficient than MCMC. They couple the MAP approach with a dynamic-programming approach that separates the estimation of rates (branch length) from the estimation of times (which often require assumption of a molecular clock). While I can’t speak with much authority on the MAP approach or yet how well this compares on different datasets, it suggests a different method to tackle these problems. They authors point out one drawback with their approach is it only allows for derivation of point-estimates so statistical confidences like bootstrap support are not easily calculated through this approach. Their software, called PRIME is available here and I will be curious to see how it performs in other peoples’ hands.

Akerborg, O., Sennblad, B., Lagergren, J. (2008). Birth-death prior on phylogeny and speed dating. BMC Evolutionary Biology, 8(1), 77. DOI: 10.1186/1471-2148-8-77

B. dendrobatidis strain JAM81 released

B.dendrobatidis zoosporeThe following is an announcement to the B.dendrobatidis and fungal community at large from Alan Kuo at JGI. This is the JAM81 strain (Jess Morgan collected from a frog in the California Sierra Nevada). The JEL423 (Joyce Longcore, collected in Panama) strain genome sequence and annotation is available from the Broad Institute.

Please do contact me if you would like to contribute to assigning functions to the annotation. We’re in the last round of analyses for some of the genome work, but if there are particular questions you want to contribute to, we’re open to collaborators and can outline the basis of our work to see how other work can complement it.

From Alan Kuo at JGI:

The JGI Batrachochytrium annotation portal is now on the public JGI website. As it is public, no password is required.

For those of you who have not yet registered to be an annotator, go to this new link to register.As before, please choose a username that is personal, so that other annotators may be able to recognize it as yours. A derivative of your personal name would be best.

Those of you who are already registered, you do not need to do anything. Your old pre-release username and password are valid on the new public portal too.

As always, please direct all questions and problems to me. Use email or phone: Cheers, Alan.

Some information about the assembly and annotation:

The first annotation of the 127 scaffolds and 24 Mbp of JGI’s 8.74X assembly of the Batrachochytrim dendrobatidis JAM81 genome. We predict 8732 genes, with the following average properties:

Gene length 1825.16 nt
Transcript length 1407.29 nt
Protein length 450.56 aa
Exon frequency 4.29 exons/gene
Exon length 328.37 nt
Intron length 129.18 nt
Gene density 359.1 genes/Mbp scaffold

The genes were found by the following methods:
Total models 8732 (100%)
Jason’s models 3214 (37%)
cDNAs and ESTs 518 (6%)
Similarity to nr 1928 (22%)
ab initio 3072 (35%)

The genes were validated by the following evidence:
start+stop codons 7990 (92%)
EST support 2488 (28%)
nr hit 6787 (78%)
Pfam hit 4329 (50%)

Orthology detection software

Blogging about Peer-Reviewed Research A paper in PLoS One, Assessing Performance of Orthology Detection Strategies Applied to Eukaryotic Genomes, reports a new approach to assess the performance of automated orthology detection. These authors also wrote the OrthoMCL (2006 DB paper, 2003 algorithm paper) which uses MCL to build orthologous gene families. The authors discuss the trade-offs between highly sensitive specific tree-based methods and fast but less sensitive approaches of the Best-Reciprocal-Hits from BLAST or FASTA or some of the hybrid approaches. The authors employ Latent Class Analysis (LCA) to aid in “evaluation and optimization of a comprehensive set of orthology detection methods, providing a guide for selecting methods and appropriate parameters”. LCA is also the statistical basis for feature choice in combing gene predictions into a single set of gene calls in GLEAN written by many of the same authors including Aaron Mackey.

I’ve been reading a lot of orthology and gene tree-species tree reconcilation papers lately, some are listed in Ian Holmes’s group as well as listing some of the software on the BioPerl site. This also follows with on our Phyloinformatics hackathon work which we are trying to formalize in some more documentation for phyloinformatics pipelines to support some of the described use cases. I’m also applying some of this to a tutorial I’m teaching at ISMB2007 this summer.

Fungal Genetics 2007 details

I’m including a recapping as many of the talks as I remember. There were 6 concurrent sessions each afternoon so you have to miss a lot of talks. The conference was bursting at the seams as it was- at least 140 people had to be turned away beyond the 750 who attended.

If there was any theme in the conference it was “Hey we are all using these genome sequences we’ve been talking about getting”. I only found the overview talks that solely describe the genome solely a little dry as compared to those more focused on particular questions. I guess my genome palate is becoming refined.

Continue reading Fungal Genetics 2007 details

Deeper and Deeper, Down the Transcriptome-hole We Fall

Your eye contains the same genetic content as your fingernail, but these two tissues look nothing alike. One significant cause of this difference is the tissue specific regulation of the genes in the genome. In some tissues in your body, a gene may be expressed (transcribed) while that same gene may be silent in another tissue type. A great deal of modern biological research explores the regulation of expression of all the genes in a genome, collectively known as the transcriptome. Such studies are, for example, aimed at understanding which genetic regulation events account for the differences between an eye and a fingernail.

However, the effectiveness of this research is predicated upon actually knowing which parts of the genome are capable of being expressed and, subsequently, regulated. Conventionally, researchers extract RNA from an organism grown in various conditions (or, as in the case of our example, various tissues from an organism) and clone and sequence the RNA to identify at least a subset of genes that are expressed (Ebbole 2004*). Such Expressed Sequence Tags (ESTs) have proven vital to our understanding of gene and gene structure annotation as they frequently provide evidence of intron splice sites. While this method has facilitated a robust understanding of gene regulation, it is expensive, time consuming, and provides a relatively low coverage of the transcriptome. If our goal is to understand everything that is expressed, then we need a superior tool.

Enter SAGE (serial analysis of gene expression) and MPSS (massively parallel signature sequencing) [Irie 2003*, Harbers 2005*]. Both methods sequence short tags of a transcript’s 3′ end. SAGE uses conventional sequencing technology while MPSS uses Solexa, Inc.’s novel bead-based hybridization technology. One of the massive advantages of these technologies is the number of sequences they provide: large EST databases are on the order of several tens of thousands, while SAGE generally provides 100,000 to 200,00 tags and MPSS can provide over a million signatures. That being said, there are still questions regarding the sensitivity of the depth of coverage of the transcriptome. It may well be that despite a lower total sequence count, ESTs provide more information about what parts of the genome are expressed.

Fortunately, Gowda et al put all three methods to work as well as an RNA microarray (which doesn’t provide sequence, but enables its inference through hybridization) in their recent study of the Magnaporthe grisea transcriptome [Gowda 2006]. M. grisea is the causative agent of rice blast, a devastating disease that results in tremendous crop yield loss. The researchers evaluated two tissues types: the non-pathogenic mycelium and the invasive, plant penetrating appressorium.

Interestingly, 40% of the MPSS tags and 55% of the SAGE tags identified represent novel genes as they had no matches in the existing M. grisea JGI EST collection. Additionally, the authors found that no one method could identify the majority of the transcripts, but that a two-way combination of array data, MPSS or SAGE could provide over 80% of the total unique transcripts all of the methods identified. One additional suprise was that roughly a quarter of the genes identified also produced an antisense RNA, possibly for siRNA regulation of the gene.

The long story short appears to be that there is, as of yet, no magic bullet of a method. To adequately cover the transcriptome, multiple techniques are required.

*These references are, unfortunately, not located in an open access journal.