Microsporidia genomes on the way

New genomes from Microsporidia are on the way from the Broad Institute and other groups, and will be a boon to those working on these fascinating creatures. Microsporidia are obligate intracellular parasites of eukaryotic cells and many can cause serious disease in humans. Some parasitize worms and insects too. The evolutionary placement of these species in the fungi is still debated with recent evidence placing them as derived members of the Mucormycotina based on shared synteny (conserved gene order), in particular around the mating type locus.  There is still some debate as to where this group belongs in the Fungal kingdom, with their highly derived characteristics and long branches they are still make them hard to place.  The synteny-based evidence was another way to find a phylogenetic placement for them but it would be helpful to have additional support in the form of additional shared derived characteristics that group Mucormycotina and Microsporidia. There is hope that increased number of genome sequences and phylogenomic approaches can help resolve the placement and more further understand the evolution of the group.

For data analysis, a new genome database for comparing these genomes is online called MicrosporidiaDB. This project has begun incorporating the available genomes and providing a data mining interface that extends from the EuPathDB project.

Distribution of fungal ITS sequences in GenBank

As part of background in preparing a grant I ended up writing a few scripts to see the distribution of fungal species with ITS data in GenBank.  The whole spreadsheet of the data is public and available here and I walk you through the data generation and summary below.

ITS (Internal Transcribed Spacer) is the typically used barcode for identifying fungi at the species level as it works for most (but not all) groups of Fungi. It falls between highly conserved nuclear rDNA genes (18S, 5.8S, 28S) but tends to be hypervariable making it a reasonable locus for identification of species since it tends to be unique between species but fairly unchanged among individuals from the same species. You can see a Map of the amplified region from Tom Brun’s site or info at Rytas Vilgalys’s site among others.

The script to extract these and dump the numbers from GenBank uses Perl, BioPerl, and is plotted in a Google docs table. I queried for all ITS sequences with a pretty simple query – some people use a better more thorough query to get the list of GIs so I separated the GI query from the statistics about taxonomy.

The GI query code uses BioPerl and queries GenBank over the web to dump out a file of GI numbers  The code is in this Perl script.

This generates a file with GI (genbank identifiers) numbers for nucleotide records. This is not cleaned up to remove problematic seqs, but since we’re interested in overall statistics, I don’t think is that important if there are some records with problem.  You might want to do some cleanup of these data and expand the query before using it as a reference ITS database for your BLAST queries. See tools built by Henrik Nilsson and others like Emerencia for some of the cleanup and detection of problems with a dataset like this of ITS.

But given a list of GIs from any query – in our case of ITS sequences – what is the distribution of taxa (based on what is specified by the submitted which is not always correct!)? Of course some aren’t specified to the species level or even to the genus level so the code has to be smart enough to put those in a different category.  But of those specified to a particular taxonomic level – what are they?  This script tallies the information about the phyla and genus and dumps them out – it takes a while to run the first time because it must build a database for all the GI to taxon record links (gi_to_taxa_nucl.dmp file from ncbi taxonomy) so be prepared to wait a while and dedicate several dozen gigabytes to get this all working the first time.

So what is the most abundant deposited genus?  Well according to this analysis it is Fusarium. Which are found everywhere especially in soil. This distribution may have much more to do with the types of places being sampled and the types of questions researchers are working on rather than about relative abundance worldwide so take it as an interesting observation of what is in the databases!  Only in particular environments with dedicated studies to fungal species (for example, the indoor environment or a particular area of a forest or fungi associated with trees in an urban and rural environment or one of many other studies not mentioned) can we really say something. What is important to note also is the massively parallel sequencing studies using 454 are coming online and not necessarily being dumped directly into this particular database at GenBank – these number represent the mainly Sanger clone sequenced data from years past, but it will be a whole new ball game in the next few years as studies start doing 454 sequencing as primary means to identify community structure.

 

 

 

 

click on image to see this in google docs spreadsheet

 

 

 

 

So who is generating all that data — well I wrote another version of the script which dumps out the authors for records from a particular taxa by querying the genbank record for the author field of all the records that came from a particular taxa.
The data are in this spreadsheet.

So a few bits of code using queries of GenBank and BioPerl to link things together, hope you see some sense of what is out there and maybe can think of interesting variations on this theme to address other data mining questions.

A mushroom on the cover

I’ll indulge a bit here to happily to point to the cover of this week’s PNAS with an image of Coprinopsis cinerea mushrooms fruiting referring to our article on the genome sequence of this important model fungus.  You should also enjoy the commentary article from John Taylor and Chris Ellison that provides a summary of some of the high points in the paper.

Coprinopsis cover

Stajich, J., Wilke, S., Ahren, D., Au, C., Birren, B., Borodovsky, M., Burns, C., Canback, B., Casselton, L., Cheng, C., Deng, J., Dietrich, F., Fargo, D., Farman, M., Gathman, A., Goldberg, J., Guigo, R., Hoegger, P., Hooker, J., Huggins, A., James, T., Kamada, T., Kilaru, S., Kodira, C., Kues, U., Kupfer, D., Kwan, H., Lomsadze, A., Li, W., Lilly, W., Ma, L., Mackey, A., Manning, G., Martin, F., Muraguchi, H., Natvig, D., Palmerini, H., Ramesh, M., Rehmeyer, C., Roe, B., Shenoy, N., Stanke, M., Ter-Hovhannisyan, V., Tunlid, A., Velagapudi, R., Vision, T., Zeng, Q., Zolan, M., & Pukkila, P. (2010). Insights into evolution of multicellular fungi from the assembled chromosomes of the mushroom Coprinopsis cinerea (Coprinus cinereus) Proceedings of the National Academy of Sciences, 107 (26), 11889-11894 DOI: 10.1073/pnas.1003391107

Aspergillus has a posse

aspergillusposse

Shepard Fairley has gotten alot of notice lately for his Obama art that has been replicated pretty much everywhere. I mocked up a homage to his earlier street art — here we’ll discuss the growing Aspergillus genome posse.

But the work from mainly the JCVI, Broad Institute, JGI, NITE, and Sanger centre has generated an excellent collection of genome sequences for the Eurotiales clade (feel free to get a login for the wiki and add other that are missing).  The Aspergillus community now has a AGD – Aspergillus Genome Database project that includes a curator of genome annotation (they are hiring) and presumably literature in the SGD and CGD model of curation.

I think a lot of other projects have a Posse too (or maybe just a loosely organized band) in terms of a community of people working on related species and willing to work together to coordinate.  As these sort of “clade” databases start to develop we will have better clusters of information that can be mapped among multiple species.

Eventually I hope this will spur efforts for more coordinated genome databases for comparative genomic and transfer of known gene and functional information between experimental systems.  The efforts really require coordination or centralization of the data so that gene models can be updated as well as orthologs and phylogenomic inference of function.

First release of N.tetrasperma and N.discreta

The JGI in collaboration with our lab at Berkeley have released the Neurospora tetrasperma (mat A) and N. discreta (mat A) genome sequences and annotation after about two years of work.  These are two closely related species to the well studied laboratory workhorse Neurospora crassa.

The N.tetrasperma assembly (8X) has an N50 of 976kb and is highly colinear with the N.crassa genome.  With the JGI, we’ve also done some additional 454 sequencing which will represent an improved assembly and 23X coverage in the next release.  We also did some comparative scaffolding and can basically double that N50 – most of which looks good when compared to the improved V2 assembly.

The N.discreta assembly (8X) is also quite good with an N50 of 2.3 Mb. For comparison, the V7 of N.crassa has an N50 of 664 kb. although with genetic map information the 250+ contigs can be scaffolded into 7 chromosomes with 146 unmapped contigs.

Both N.discreta and N.tetrasperma genomes contain about 10k predicted genes similar to counts in other related species like N.crassa and Podospora anserina.

We’re finalizing several analyses to present at the Asilomar meeting to describe these Neurospora genomes and comparisons with other Sordariomycete species.

A few tool updates

I’m working to make more data available in the genome browsers for fungi. One is adding in the Primer information from the Neurospora KO project to the Neurospora browser to indicate the position and primer sequences for all the gene knockouts being (or already) constructed.  At least 60% of the genes have been knocked out and are available from the FGSC.

We’re also integrating SNP data using the HapMap glyphs in which you can see one way to view this information in the Genome Browser for Coccidioides.  Working on other information including PhastCons conservation profiles and other information in our development server and hope to make this public soon.

Coprinopsis cinereus genome annotation updated

Coprinus cinereus genome projectThe Broad Institute in collaboration with many of the Coprinopsis cinereus (Coprinus cinerea) community of researchers have updated the genome annotation for C. cinereus with additional gene calls based on ESTs and improved gene callers. The annotation was made on the 13 chromosome assembly produced by work by SEMO fungal biology group and collaborators across the globe including a BAC map from H. Muraguchi.  Thanks to Jonathan Goldberg and colleagues at the Broad Institute for getting this updated annotation out the door.

 

This updated annotation is able to join and split several sets of genes and the gene count sits at just under 14k genes in this 36Mb genome. There are a couple of hiccups in the GTF and Genome contig/supercontig file naming that I am told will be fixed by early next week.  Additional work to annotate the “Kinome” by the Broad team provides some promising new insight to this genome annotation as well.

We’re using this updated genome assembly address questions about evolution of genome structure by studying syntenic conservation and aspects of crossing over points during meiosis.  The C. cinereus system has long been used as model for fungal development and morphogensis of mushrooms as it is straightforward to induce mushroom fruiting in the laboratory.  It also a model for studying meiosis due to the synchronized meiosis occurring in the cells in the cap of the mushroom.

Happy genome shrooming.

Lest you think annotation is easy

Ensembl!Ewan Birney and Ensembl (the other/original genome browser depending on if you are a UCSC junkie) have started blogging a bit more about what is going on under the proverbial hood over there in Hinxton.  There are some great nuggets talking about what are some of the current problems.  These bite-sized comments should be a great glimpse into what is going on without drowning in the deluge that is ensembl-dev.  

This is a recent post on the challenges of gene annotation coordination among “manual” and “automated” annotation of gene structure of groups at the same institution.  

Scale that up among multiple genomes, genome centers, quality of prediction programs and assemblies, and you can see why the fungal genome comparisons could use a little bit more help. It is great to hear what the animal genome annotation groups are doing to solve informatics challenges and data management issues and coordination. I’m big fan of more informatics+science in the open where it is feasible. 

(re)Annotating GenBank

NCBI LogoTom Bruns, Martin Bidartondo and 250 others sent a letter to Science describing the current problems with fixing annotation in GenBank. There is an entertaining accompanying news article that interviews several people about the problem of updating annotation and species assigned to sequences in the database. In particular the problem for mycologists that many fungi found from metagenomic approaches are only identified through molecular sequences and having the wrong species associated with a sequence can be difficult when studying community ecology composition.  This problem is not limited to fungi by any means, but recent reports find as many as 20% of fungal Intergenic Spacer (ITS) sequences are mis-attributed to the wrong species. 

There’s a nice quote in the news article from Steven Salzberg talking about the difficulties in getting sequences, especially from big centers, updated. I’m sure he is thinking of many examples, like reclassifying some Drosophila sequence traces.

Continue reading

Some links

ResearchBlogging.org

I’ve been too busy to post much these last few days, but here are a few links to some papers I found interesting in my recent browsing.

Schmitt, I., Partida-Martinez, L.P., Winkler, R., Voigt, K., Einax, E., Dölz, F., Telle, S., Wöstemeyer, J., Hertweck, C. (2008). Evolution of host resistance in a toxin-producing bacterial–fungal alliance. The ISME Journal DOI: 10.1038/ismej.2008.19

LEVASSEUR, A. (2008). FOLy: an integrated database for the classification and functional annotation of fungal oxidoreductases potentially involved in the degradation of lignin and related aromatic compounds. Fungal Genetics and Biology DOI: 10.1016/j.fgb.2008.01.004

Shivaji, S., Bhadra, B., Rao, R.S., Pradhan, S. (2008). Rhodotorula himalayensis sp. nov., a novel psychrophilic yeast isolated from Roopkund Lake of the Himalayan mountain ranges, India. Extremophiles DOI: 10.1007/s00792-008-0144-z