Category Archives: methods

Still time to sign up for EMBO Comparative Genomics meeting

[via Teun Boekhout]

This year looks like another great lineup of speakers for the EMBO Comparative Genomics of Microorganisms: ‘Understanding the Complexity of Diversity’ 15-20 Oct 2011 Sant Feliu de Guixols, Spain.

Andrew Allen J. Craig Venter Institute US
Anders Blomberg Göteborg University SE
Chris Bowler École Normale Supérieure FR
Gertraud Burger University of Montreal CA
Bernard Dujon Institut Pasteur FR
Toni Gabaldón CRG, Barcelona ES
Ursula Goodenough Washington University US
Michael Gray Dalhousie University CA
Joseph Heitman Duke University US
Christiane Hertz-Fowler University of Liverpool UK
Regine Kahmann Max Planck Institute DE
Patrick Keeling University of British Columbia CA
Nicole King UC, Berkeley US
Edda Klipp Humboldt University DE
Veronique Leh Louis University of Strasbourg FR
Jan Pawlowski University of Geneva CH
Jure Piskur Lund University SE
Tom Richards University of Exeter UK
Andrew J. Roger Dalhousie University CA
David Roos University of Pennsylvania US
Iñaki Ruiz-Trillo University of Barcelona ES
Joseph Schacherer University of Strasbourg FR
Artur Scherf Institut Pasteur FR
Joey Spatafora Oregon State University US
Nicholas Talbot University of Exeter UK
Kevin Verstrepen University of Leuven BE
Eric Westhof University of Strasbourg FR
Patrick Wincker Genoscope FR
Ken Wolfe Smurfit Institute of Genetics IE
Alexandra Z. Worden University of California US

Some comments from former participants:

Comments from 2009 meeting

Overall rating

Based on responses from 80% of participants:

Excellent 50%; Very Good 44%; Good 6%.



It is hard to improve the meeting. It’s a good mixture of conference and workshop with a lot of input from expert of adjacent field.

I strongly support the idea the meeting is organized in the future at a regular basis.

Very high quality, open minded with presentations ranging from pure genomics to implementation in the field of ecology; plenty of novelties. Plenty of time to discuss and to establish potential collaborations

I hope to have the possibility to go in the future to this meeting. We learn a lot, and also the size is well, the students have the possibility to talk of discuss with senior

Great work!

Thanks to the organizers for an extremely interesting and productive meeting.

Great meeting. This is a unique meeting because it brings together a group of scientists that dont normally interact with each other. Thus, great opportunities for cross-interactions. This meeting has the potential to fill a very unique niche. I enjoyed meeting new people from diverse fields. I plan to attend again and encourage my colleagues to do so.

This meeting was a great match to my interests but also challenged me to think outside of my normal sphere.  I applaud the organizers and the participants in making this a useful meeting.

The meeting was very well organized and at a very good location. I enjoyed it very much.

I hope this meeting continues as it was a valuable forum for the field of comparative genomics.

This meeting is unique in its broad organism focus. Please keep supporting it.

Fungal barcoding progress

qrcodeThe fungal barcoding conference met in Amsterdam in April and finalized the proposed selection of a barcode sequence for fungal identification. The efforts of the Barcode of Life and the fungal barcoding working group have already produced databases that can be searched and are working to generate a paper describing their efforts and importantly, protocols for Identification that can be standardized for analysis and identification of fungi.

Neurospora annotation update (v5)

Here is a message from the Broad Institute about a gene annotation update that was made recently in response to an issue that was revealed in the June 2010 release.  This new version is called V5 and should be on its way to GenBank.

Dear Neurospora scientists,

Recently we discovered an issue with the way locus tags were assigned
to our most recent Neurospora gene set, released publicly on the Broad
website in June of 2010. Many genes in this gene set have mismatched
locus numbers compared to the same genes released in February 2010.
Adding to the confusion, both releases were labeled version 4.

To remedy this we have recalled the June locus numbers and released a
new, version 5 gene set. Genes in this set have been numbered to
preserve historical locus numbers (back to the original genbank
release) as much as possible.

Folks who call their favorite genes by their v1, v2 or v3 numbers can
search for them on our web page, which will map them to v5
automatically and accurately. The same will work for most v4 numbers.
Unfortunately, 863 genes have different locus tags in the two v4
releases. If you search for one of them, you will get two hits - the
v5 gene that the February edition mapped to, and the v5 gene that the
June edition mapped to.

Two examples to clarify:

A. Suppose you search for NCU11713.4 on our web page. This query will
retrieve two genes, NCU11688.5 and NCU11713.5. The gene which in the
February release was called NCU11713.4 is the same as NCU11688.5,
while the gene labeled NCU11713.4 in June is the same as NCU11713.5.

B. Searching for NCU11324.4 yields but one hit because that gene, like
most genes, was consistently numbered between the two releases labeled

If you are not sure when you downloaded your genes, the following may
help. If you see any of these locus numbers in your gene set:

NCU00129.4, NCU00457.4, NCU00499.4, NCU00556.4, NCU00627.4,
NCU00685.4, NCU00768.4, NCU00856.4, NCU00986.4, NCU01064.4,
NCU01065.4, NCU01282.4, NCU01299.4, NCU01300.4, NCU01483.4,
NCU01559.4, NCU01560.4, NCU01610.4, NCU01611.4, NCU01664.4,
NCU01665.4, NCU01871.4, NCU01903.4, NCU02200.4, NCU02259.4,
NCU02666.4, NCU02758.4, NCU02837.4, NCU02998.4, NCU03047.4,
NCU03206.4, NCU03773.4, NCU04239.4, NCU04240.4, NCU04518.4,
NCU04519.4, NCU04710.4, NCU04711.4, NCU05275.4, NCU05512.4,
NCU05776.4, NCU06013.4, NCU06370.4, NCU06732.4, NCU07107.4,
NCU07259.4, NCU07260.4, NCU07301.4, NCU07405.4, NCU07856.4,
NCU07857.4, NCU08090.4, NCU08182.4, NCU08323.4, NCU08332.4,
NCU09085.4, NCU09256.4, NCU09257.4, NCU09998.4, NCU10166.4,
NCU10574.4, NCU11040.4, NCU11240.4, NCU11253.4, NCU11376.4,
NCU11390.4, NCU11393.4

then your genes are from the February 2010 gene set. However, if you see

NCU00082.4, NCU00083.4, NCU00084.4, NCU00085.4, NCU00516.4,
NCU01819.4, NCU04299.4, NCU04300.4, NCU04301.4, NCU04302.4,
NCU04303.4, NCU04304.4, NCU04305.4, NCU05000.4, NCU05111.4,
NCU05112.4, NCU05113.4, NCU05114.4, NCU05115.4, NCU05116.4,
NCU05448.4, NCU05452.4, NCU06667.4, NCU07323.4, NCU09066.4,
NCU10179.4, NCU10301.4, NCU10379.4, NCU10383.4, NCU10753.4,
NCU10866.4, NCU10914.4, NCU11068.4, NCU11182.4, NCU12157.4,
NCU12158.4, NCU12159.4, NCU12160.4, NCU12161.4, NCU12162.4,
NCU12163.4, NCU12164.4, NCU12165.4, NCU12166.4, NCU12167.4,
NCU12168.4, NCU12169.4, NCU12170.4, NCU12171.4, NCU12172.4,
NCU12173.4, NCU12174.4, NCU12175.4, NCU12176.4, NCU12177.4,
NCU12178.4, NCU12179.4, NCU12180.4, NCU12181.4, NCU12182.4,
NCU12183.4, NCU12184.4, NCU12185.4, NCU12186.4, NCU12187.4, NCU12188.4

then your genes are from the June 2010 release.

Attached please find five mapping tables which can be used to migrate
locus numbers from any of the previous releases to the latest version
5 locus tags (linked below).

We apologize for any confusion this may cause.
The Broad Institute

I’ve also uploaded the locus update files which maps between versions of the annotation.

Microsporidia genomes on the way

New genomes from Microsporidia are on the way from the Broad Institute and other groups, and will be a boon to those working on these fascinating creatures. Microsporidia are obligate intracellular parasites of eukaryotic cells and many can cause serious disease in humans. Some parasitize worms and insects too. The evolutionary placement of these species in the fungi is still debated with recent evidence placing them as derived members of the Mucormycotina based on shared synteny (conserved gene order), in particular around the mating type locus.  There is still some debate as to where this group belongs in the Fungal kingdom, with their highly derived characteristics and long branches they are still make them hard to place.  The synteny-based evidence was another way to find a phylogenetic placement for them but it would be helpful to have additional support in the form of additional shared derived characteristics that group Mucormycotina and Microsporidia. There is hope that increased number of genome sequences and phylogenomic approaches can help resolve the placement and more further understand the evolution of the group.

For data analysis, a new genome database for comparing these genomes is online called MicrosporidiaDB. This project has begun incorporating the available genomes and providing a data mining interface that extends from the EuPathDB project.

Presents for the holidays – Plant pathogen genomes

Though a bit cliche, I think the metaphor of “presents under the tree” of some new plant pathogen genomes summarized in 4 recent publications is still too good to resist.  There are 4 papers in this week’s Science that will certainly make a collection of plant pathogen biologists very happy. There are also treats for the general purpose genome biologists with descriptions of next generation/2nd generation sequencing technologies, assembly methods, and comparative genomics. Much more inside these papers than I am summarizing so I urge you to take look if you have access to these pay-for-view articles or contact the authors for reprints to get a copy.


These include the genome of biotrophic oomycete and Arabidopsis pathogen Hyaloperonospora arabidopsidis (Baxter et al). While preserving the health of Arabidopsis is not a major concern of most researchers, this is an excellent model system for studying plant-microbe interaction.  The genome sequence of Hpa provides a look at specialization as a biotroph. The authors found a reduction (relative to other oomycete species) in factors related to host-targeted degrading enzymes and also reduction in necrosis factors suggesting the specialization in biotrophic lifestyle from a necrotrophic ancestor. Hpa also does not make zoospores with flagella like its relatives and sequence searches for 90 flagella-related genes turned up no identifiable homologs.

While the technical aspects of sequencing are less glamourous now the authors used Sanger and Illumina sequencing to complete this genome at 45X sequencing coverage and an estimated genome size fo 80 Mb. To produce the assembly they used Velvet on the paired end Illumina data to produce a 56Mb assembly and PCAP (8X coverage to produce a 70Mb genome) on the Sanger reads to produce two assemblies that were merged with an ad hoc procedure that relied on BLAT to scaffold and link contigs through the two assembled datasets. They used CEGMA and several in-house pipelines to annotate the genes in this assembly. SYNTENY analysis was completed with PHRINGE. A relatively large percentage (17%) of the genome fell into ‘Unknown repetitive sequence’ that is unclassified – larger than P.sojae (12%) but there remain a lot of mystery elements of unknown function in these genomes.  If you jump ahead to the Blumeria genome article you’ll see this is still peanuts compared to that Blumeria’s genome (64%). The largest known transposable element family in Hpa was the LTR/Gypsy element. Of interest to some following oomycete literature is the relative abundance of the RLXR containing proteins which are typically effectors – there were still quite a few (~150 instead of ~500 see in some Phytophora genomes).



A second paper on the genome of the barley powdery mildew Blumeria graminis f.sp. hordei and two close relatives Erysiphe pisi, a pea pathogen, and Golovinomyces orontii, an Arabidopsis thaliana pathogen (Spanu et al).  These are Ascomycetes in the Leotiomycete class where there are only a handful of genomes Overall this paper tells a story told about how obligate biotrophy has shaped the genome. I found most striking was depicted in Figure 1. It shows that typical genome size for (so far sampled) Pezizomycotina Ascomycetes in the ~40-50Mb range whereas these powdery mildew genomes here significantly large genomes in ~120-160 Mb range. These large genomes were primarily comprised of Transposable Elements (TE) with ~65% of the genome containing TE. However the protein coding gene content is still only on the order of ~6000 genes, which is actually quite low for a filamentous Ascomycete, suggesting that despite genome expansion the functional potential shows signs of reduction.  The obligate lifestyle of the powdery mildews suggested that the species had lost some autotrophic genes and the authors further cataloged a set of ~100 genes which are missing in the mildews but are found in the core ascomycete genomes. They also document other genome cataloging results like only a few secondary metabolite genes although these are typically in much higher copy numbers in other filamentous ascomycetes (e.g. Aspergillus).  I still don’t have a clear picture of how this gene content differs from their closest sequenced neighbors, the other Leotiomycetes Botrytis cinerea and Sclerotinia sclerotium, are on the order of 12-14k genes. Since the E. pisi and G. orontii data is not yet available in GenBank or the MPI site it is hard to figure this out just yet – I presume it will be available soon.

More techie details — The authors used Sanger and second generation technologies and utilized the Celera assembler to build the assemblies from 120X coverage sequence from a hybrid of sequencing technologies.  Interestingly, for the E. pisi and G. orontii assemblies the MPI site lists the genome sizes closer to 65Mb in the first drafts of the assembly with 454 data so I guess you can see what happens when the Newbler assembler which overcollapses repeats. They also used a customized automated annotation with some ab intio gene finders (not sure if there was custom training or not for the various gene finders) and estimated the coverage with the CEGMA genes. I do think a Fungal-Specific set of core-conserved genes would be in order here as a better comparison set – some nice data like this already exist in a few databases but would be interesting to see if CEGMA represents a broad enough core-set to estimate genome coverage vs a Fungal-derived CEGMA-like set.


A third paper in this issue covers the genome evolution in the massively successful pathogen Phytophora infestans through resequencing of six genomes of related species to track recent evolutionary history of the pathogen (Raffaele et al). The authors used high throughput Illumina sequencing to sequence genomes of closely related species. They found a variety differences among genes in the pathogen among the findings “genes in repeat-rich regions show[ed] higher rates of structural polymorphisms and positive selection”. They found 14% of the genes experienced positive selection and these included many (300 out of ~800) of the annotated effector genes. P. infestans also showed high rates of change in the repeat rich regions which is also where a lot of the disease implicated genes are locating supporting the hypothesis that the repeat driven expansion of the genome (as described in the 2009 genome paper). The paper generates a lot of very nice data for followup by helping to prioritize the genes with fast rates of evolution or profiles that suggest they have been shaped by recent adaptive evolutionary forces and are candidates for the mechanisms of pathogenecity in this devastating plant pathogen.


A fourth paper describes the genome sequencing of Sporisorium reilianum, a biotrophic pathogen that is closely related species to corn smut Ustilago maydis (Schirawski et al). Both these species both infect maize hosts but while U. maydis induces tumors in the ears, leaves, tassels of corn the S. reilianum infection is limited to tassels and . The authors used comparative biology and genome sequencing to try and tease out what genetic components may be responsible for the phenotypic differences. The comparison revealed a relative syntentic genome but also found 43 regions in U. maydis that represent highly divergent sequence between the species. These regions contained disproportionate number of secreted proteins indicating that these secreted proteins have been evolving at a much faster rate and that they may be important for the distinct differences in the biology. The chromosome ends of U. maydis were also found to contain up to 20 additional genes in the sub-telomeric regions that were unique to U. maydis. Another fantastic finding that this sequencing and comparison revealed is more about the history of the lack of RNAi genes in U. maydis. It was a striking feature from the 2006 genome sequence that the genome lacked a functioning copy of Dicer. However knocking out this gene in S. reilianum failed to show a developmental or virulence phenotype suggesting it is dispensible for those functions so I think there will be some followups to explore (like do either of these species make small RNAs, do they produce any that are translocated to the host, etc).  The rest of the analyses covered in the manuscript identify the specific loci that are different between the two species — interestingly a lot of the identified loci were the same ones found as islands of secreted proteins in the first genome analysis paper so the comparative approach was another way to get to the genes which may be important for the virulence if the two organisms have different phenotypes. This is certainly the approach that has also been take in other plant pathogens (e.g. Mycosphaerella, Fusarium) and animal pathogens (Candida, Cryptococcus, Coccidioides) but requires a sampling species or appropriate distance that that the number of changes haven’t saturated our ability to reconstruct the history either at the gene order/content or codon level.

Without the comparison of an outgroup species it is impossible to determine if U. maydis gained function that relates to the phenotypes observed here through these speculated evolutionary changes involving new genes and newly evolved functions or if S. reilianum lost functionality that was present in their common ancestor. However, this paper is an example of how using a comparative approach can identify testable hypotheses for origins of pathogenecity genes.


Hope everyone has a chance to enjoy holidays and unwrap and spend some time looking at these and other science gems over the coming weeks.


Baxter, L., Tripathy, S., Ishaque, N., Boot, N., Cabral, A., Kemen, E., Thines, M., Ah-Fong, A., Anderson, R., Badejoko, W., Bittner-Eddy, P., Boore, J., Chibucos, M., Coates, M., Dehal, P., Delehaunty, K., Dong, S., Downton, P., Dumas, B., Fabro, G., Fronick, C., Fuerstenberg, S., Fulton, L., Gaulin, E., Govers, F., Hughes, L., Humphray, S., Jiang, R., Judelson, H., Kamoun, S., Kyung, K., Meijer, H., Minx, P., Morris, P., Nelson, J., Phuntumart, V., Qutob, D., Rehmany, A., Rougon-Cardoso, A., Ryden, P., Torto-Alalibo, T., Studholme, D., Wang, Y., Win, J., Wood, J., Clifton, S., Rogers, J., Van den Ackerveken, G., Jones, J., McDowell, J., Beynon, J., & Tyler, B. (2010). Signatures of Adaptation to Obligate Biotrophy in the Hyaloperonospora arabidopsidis Genome Science, 330 (6010), 1549-1551 DOI: 10.1126/science.1195203

Spanu, P., Abbott, J., Amselem, J., Burgis, T., Soanes, D., Stuber, K., Loren van Themaat, E., Brown, J., Butcher, S., Gurr, S., Lebrun, M., Ridout, C., Schulze-Lefert, P., Talbot, N., Ahmadinejad, N., Ametz, C., Barton, G., Benjdia, M., Bidzinski, P., Bindschedler, L., Both, M., Brewer, M., Cadle-Davidson, L., Cadle-Davidson, M., Collemare, J., Cramer, R., Frenkel, O., Godfrey, D., Harriman, J., Hoede, C., King, B., Klages, S., Kleemann, J., Knoll, D., Koti, P., Kreplak, J., Lopez-Ruiz, F., Lu, X., Maekawa, T., Mahanil, S., Micali, C., Milgroom, M., Montana, G., Noir, S., O’Connell, R., Oberhaensli, S., Parlange, F., Pedersen, C., Quesneville, H., Reinhardt, R., Rott, M., Sacristan, S., Schmidt, S., Schon, M., Skamnioti, P., Sommer, H., Stephens, A., Takahara, H., Thordal-Christensen, H., Vigouroux, M., Wessling, R., Wicker, T., & Panstruga, R. (2010). Genome Expansion and Gene Loss in Powdery Mildew Fungi Reveal Tradeoffs in Extreme Parasitism Science, 330 (6010), 1543-1546 DOI: 10.1126/science.1194573

Raffaele, S., Farrer, R., Cano, L., Studholme, D., MacLean, D., Thines, M., Jiang, R., Zody, M., Kunjeti, S., Donofrio, N., Meyers, B., Nusbaum, C., & Kamoun, S. (2010). Genome Evolution Following Host Jumps in the Irish Potato Famine Pathogen Lineage Science, 330 (6010), 1540-1543 DOI: 10.1126/science.1193070

Schirawski, J., Mannhaupt, G., Munch, K., Brefort, T., Schipper, K., Doehlemann, G., Di Stasio, M., Rossel, N., Mendoza-Mendoza, A., Pester, D., Muller, O., Winterberg, B., Meyer, E., Ghareeb, H., Wollenberg, T., Munsterkotter, M., Wong, P., Walter, M., Stukenbrock, E., Guldener, U., & Kahmann, R. (2010). Pathogenicity Determinants in Smut Fungi Revealed by Genome Comparison Science, 330 (6010), 1546-1548 DOI: 10.1126/science.1195330

Distribution of fungal ITS sequences in GenBank

As part of background in preparing a grant I ended up writing a few scripts to see the distribution of fungal species with ITS data in GenBank.  The whole spreadsheet of the data is public and available here and I walk you through the data generation and summary below.

ITS (Internal Transcribed Spacer) is the typically used barcode for identifying fungi at the species level as it works for most (but not all) groups of Fungi. It falls between highly conserved nuclear rDNA genes (18S, 5.8S, 28S) but tends to be hypervariable making it a reasonable locus for identification of species since it tends to be unique between species but fairly unchanged among individuals from the same species. You can see a Map of the amplified region from Tom Brun’s site or info at Rytas Vilgalys’s site among others.

The script to extract these and dump the numbers from GenBank uses Perl, BioPerl, and is plotted in a Google docs table. I queried for all ITS sequences with a pretty simple query – some people use a better more thorough query to get the list of GIs so I separated the GI query from the statistics about taxonomy.

The GI query code uses BioPerl and queries GenBank over the web to dump out a file of GI numbers  The code is in this Perl script.

This generates a file with GI (genbank identifiers) numbers for nucleotide records. This is not cleaned up to remove problematic seqs, but since we’re interested in overall statistics, I don’t think is that important if there are some records with problem.  You might want to do some cleanup of these data and expand the query before using it as a reference ITS database for your BLAST queries. See tools built by Henrik Nilsson and others like Emerencia for some of the cleanup and detection of problems with a dataset like this of ITS.

But given a list of GIs from any query – in our case of ITS sequences – what is the distribution of taxa (based on what is specified by the submitted which is not always correct!)? Of course some aren’t specified to the species level or even to the genus level so the code has to be smart enough to put those in a different category.  But of those specified to a particular taxonomic level – what are they?  This script tallies the information about the phyla and genus and dumps them out – it takes a while to run the first time because it must build a database for all the GI to taxon record links (gi_to_taxa_nucl.dmp file from ncbi taxonomy) so be prepared to wait a while and dedicate several dozen gigabytes to get this all working the first time.

So what is the most abundant deposited genus?  Well according to this analysis it is Fusarium. Which are found everywhere especially in soil. This distribution may have much more to do with the types of places being sampled and the types of questions researchers are working on rather than about relative abundance worldwide so take it as an interesting observation of what is in the databases!  Only in particular environments with dedicated studies to fungal species (for example, the indoor environment or a particular area of a forest or fungi associated with trees in an urban and rural environment or one of many other studies not mentioned) can we really say something. What is important to note also is the massively parallel sequencing studies using 454 are coming online and not necessarily being dumped directly into this particular database at GenBank – these number represent the mainly Sanger clone sequenced data from years past, but it will be a whole new ball game in the next few years as studies start doing 454 sequencing as primary means to identify community structure.





click on image to see this in google docs spreadsheet





So who is generating all that data — well I wrote another version of the script which dumps out the authors for records from a particular taxa by querying the genbank record for the author field of all the records that came from a particular taxa.
The data are in this spreadsheet.

So a few bits of code using queries of GenBank and BioPerl to link things together, hope you see some sense of what is out there and maybe can think of interesting variations on this theme to address other data mining questions.

White nose syndrome genome released

The Broad Institute released their sequence of the genome of Geomyces destructans implicated in the White Nose Syndrome that is causing a massive die-offs of bats. The researchers sequenced a North America isolated strain in this project which is part of an epidemic spreading across the Northeastern United States. This is just the assembly of the genome not the annotation which will be forthcoming.

Genome sequence of mushroom Schizophyllum commune

Schizophyllum CommuneI am excited to announce the publication of another mushroom genome this week. The mushroom Schizophyllum commune is an important model system for mushroom biology, development of genome was sequenced as part of efforts at the Joint Genome Institute and a collection of international researchers.  The data and analyses from these efforts are presented in a publication appearing in Nature Biotechnology today.

Studies in mushrooms can have important impact on other research areas.  They can be useful in biotechnology as protein biosynthesis factories for producing compounds or even as an edible delivery mechanism for new drugs.  What we found in the analysis of this genome include clues to mechanisms of how white rotting fungi degrade lignin through analysis of enzyme families.  We also saw evidence for extensive antisense transcription during different developmental stages suggesting some important clues as to how some gene regulation could impact or control developmental progression.  Through gene expression comparison (by MPSS) a large number of transcription factors were shown to be differentially regulated during sexual development.  A knockout out two of these (fst3 and fst4) resulting in changes in ability to form mushrooms (fst4) or smaller mushrooms (fst3).

Several more interesting findings in this work that I hope to add back to this post when there is a little more time –

Ohm, R., de Jong, J., Lugones, L., Aerts, A., Kothe, E., Stajich, J., de Vries, R., Record, E., Levasseur, A., Baker, S., Bartholomew, K., Coutinho, P., Erdmann, S., Fowler, T., Gathman, A., Lombard, V., Henrissat, B., Knabe, N., Kües, U., Lilly, W., Lindquist, E., Lucas, S., Magnuson, J., Piumi, F., Raudaskoski, M., Salamov, A., Schmutz, J., Schwarze, F., vanKuyk, P., Horton, J., Grigoriev, I., & Wösten, H. (2010). Genome sequence of the model mushroom Schizophyllum commune Nature Biotechnology DOI: 10.1038/nbt.1643

A mushroom on the cover

I’ll indulge a bit here to happily to point to the cover of this week’s PNAS with an image of Coprinopsis cinerea mushrooms fruiting referring to our article on the genome sequence of this important model fungus.  You should also enjoy the commentary article from John Taylor and Chris Ellison that provides a summary of some of the high points in the paper.

Coprinopsis cover

Stajich, J., Wilke, S., Ahren, D., Au, C., Birren, B., Borodovsky, M., Burns, C., Canback, B., Casselton, L., Cheng, C., Deng, J., Dietrich, F., Fargo, D., Farman, M., Gathman, A., Goldberg, J., Guigo, R., Hoegger, P., Hooker, J., Huggins, A., James, T., Kamada, T., Kilaru, S., Kodira, C., Kues, U., Kupfer, D., Kwan, H., Lomsadze, A., Li, W., Lilly, W., Ma, L., Mackey, A., Manning, G., Martin, F., Muraguchi, H., Natvig, D., Palmerini, H., Ramesh, M., Rehmeyer, C., Roe, B., Shenoy, N., Stanke, M., Ter-Hovhannisyan, V., Tunlid, A., Velagapudi, R., Vision, T., Zeng, Q., Zolan, M., & Pukkila, P. (2010). Insights into evolution of multicellular fungi from the assembled chromosomes of the mushroom Coprinopsis cinerea (Coprinus cinereus) Proceedings of the National Academy of Sciences, 107 (26), 11889-11894 DOI: 10.1073/pnas.1003391107

Where can I get orthologs?

There are several databases that include orthology prediction for fungi. These all have pros and cons. Some are more comprehensive and have many more species. Some are curated orthologies and paralogy which should be pretty stable. Some are automated and groupings and ortholog group IDs change at each iteration.

  • A phylogenetic approach from a Saccharomyces perspective is at PhylomeDB.
  • Fungal Orthogroups is based on Synergy algorithm from I. Wapinski formerly of the Regev group at the Broad Institutue.
  • Yeast gene order browser (YGOB) for Saccharomyces spp and CGOB for Candida spp.
  • OrthoMCL database based on whole genomes, not a ton of fungi but useful starting set.
  • Ensembl Genomes provides ortholog prediction as part of the Compara pipeline though there is a limited phylogenetic diversity in the current Ensembl Fungal genomes.
  • TreeFam has Saccharomyces cerevisiae and Schizosaccharomyces pombe as the two fungi included in the curated ortholog assignments and phylogenies.
  • SIMAP provides pre-computed similarities among all proteins in UniProt.
  • InParanoid provides a pretty comprehensive of available 100 whole genomes and many fungal genomes which I tried to help select.
  • JGI’s Mycocosm attempts to provide a fungal focused paralog/gene family look at clusters of genes based on whole genomes
  • E-Fungi is also an attempt at automated clustering with some fancy webservices logic.
  • Fungal Transcription Factor database focused just on families of transcription factors.

Some of these tools are better than others in terms of providing downloadable tables.  Another problem is what Identifiers are used. Many biologists are using gene names or Locus identifiers not UniProt/GenPept IDs to identify genes or proteins of interest.  So tools that just cluster UniProt data aren’t as useful as those which refer to the gene or locus names. Also, providing a way to download all the data from a comparison is important for further mining and grouping of the data or cross-referencing local datasets.  One-by-one plugging in geneids is not really a tool that respects the idea that your user wants to ask sophisticated queries.

Also – beware that some approaches are very much pairwise comparisons lists whereas others are finding orthologous groupings.  So if you want to fine the Rad59 ortholog from all fungi it may be easier or harder depending on the source.

[I may make this a static page in the future to allow for more detailed updating since I know the available resources wax and wane]