The RDP database have released their LSU classifier trained on Fungal 18S sequences. This can help assign sequences to clades even if taxa-specific matches can’t be made using a naive bayesian classifier. Be great to hear how this is working for different groups!
As part of background in preparing a grant I ended up writing a few scripts to see the distribution of fungal species with ITS data in GenBank. The whole spreadsheet of the data is public and available here and I walk you through the data generation and summary below.
ITS (Internal Transcribed Spacer) is the typically used barcode for identifying fungi at the species level as it works for most (but not all) groups of Fungi. It falls between highly conserved nuclear rDNA genes (18S, 5.8S, 28S) but tends to be hypervariable making it a reasonable locus for identification of species since it tends to be unique between species but fairly unchanged among individuals from the same species. You can see a Map of the amplified region from Tom Brun’s site or info at Rytas Vilgalys’s site among others.
The script to extract these and dump the numbers from GenBank uses Perl, BioPerl, and is plotted in a Google docs table. I queried for all ITS sequences with a pretty simple query – some people use a better more thorough query to get the list of GIs so I separated the GI query from the statistics about taxonomy.
The GI query code uses BioPerl and queries GenBank over the web to dump out a file of GI numbers The code is in this Perl script.
This generates a file with GI (genbank identifiers) numbers for nucleotide records. This is not cleaned up to remove problematic seqs, but since we’re interested in overall statistics, I don’t think is that important if there are some records with problem. You might want to do some cleanup of these data and expand the query before using it as a reference ITS database for your BLAST queries. See tools built by Henrik Nilsson and others like Emerencia for some of the cleanup and detection of problems with a dataset like this of ITS.
But given a list of GIs from any query – in our case of ITS sequences – what is the distribution of taxa (based on what is specified by the submitted which is not always correct!)? Of course some aren’t specified to the species level or even to the genus level so the code has to be smart enough to put those in a different category. But of those specified to a particular taxonomic level – what are they? This script tallies the information about the phyla and genus and dumps them out – it takes a while to run the first time because it must build a database for all the GI to taxon record links (gi_to_taxa_nucl.dmp file from ncbi taxonomy) so be prepared to wait a while and dedicate several dozen gigabytes to get this all working the first time.
So what is the most abundant deposited genus? Well according to this analysis it is Fusarium. Which are found everywhere especially in soil. This distribution may have much more to do with the types of places being sampled and the types of questions researchers are working on rather than about relative abundance worldwide so take it as an interesting observation of what is in the databases! Only in particular environments with dedicated studies to fungal species (for example, the indoor environment or a particular area of a forest or fungi associated with trees in an urban and rural environment or one of many other studies not mentioned) can we really say something. What is important to note also is the massively parallel sequencing studies using 454 are coming online and not necessarily being dumped directly into this particular database at GenBank – these number represent the mainly Sanger clone sequenced data from years past, but it will be a whole new ball game in the next few years as studies start doing 454 sequencing as primary means to identify community structure.
So who is generating all that data — well I wrote another version of the script which dumps out the authors for records from a particular taxa by querying the genbank record for the author field of all the records that came from a particular taxa.
The data are in this spreadsheet.
So a few bits of code using queries of GenBank and BioPerl to link things together, hope you see some sense of what is out there and maybe can think of interesting variations on this theme to address other data mining questions.
Science has a section dedicated to Microbial Ecology including a review describing microbial biogeography studying communities on the basis of trait rather than taxonomic diversity. Certainly this interlinks with metagenomic approaches well, something I’ve been thinking about more after visiting some of the folks at Montana State Thermal Biology Institute and all the increasingly massive datasets like what CAMERA provides.
Tom Bruns, Martin Bidartondo and 250 others sent a letter to Science describing the current problems with fixing annotation in GenBank. There is an entertaining accompanying news article that interviews several people about the problem of updating annotation and species assigned to sequences in the database. In particular the problem for mycologists that many fungi found from metagenomic approaches are only identified through molecular sequences and having the wrong species associated with a sequence can be difficult when studying community ecology composition. This problem is not limited to fungi by any means, but recent reports find as many as 20% of fungal Intergenic Spacer (ITS) sequences are mis-attributed to the wrong species.
There’s a nice quote in the news article from Steven Salzberg talking about the difficulties in getting sequences, especially from big centers, updated. I’m sure he is thinking of many examples, like reclassifying some Drosophila sequence traces.
An early access to article in Science A Metagenomic Survey of Microbes in Honey Bee Colony Collapse Disorder (direct link since DOI is not updated yet) using the current favorite buzzword, metagenomics, of course, describes some early work to try and discover what is killing the honeybees. It is early access and non-free and ScienceExpress is not part of our subscription here so I’ve not actually had a chance to read it yet, but the gist of the reporting about it suggest that a virus is to blame. This is in line with what Joe DeRisi and collaborators found using their Virus chip based on some news reports earlier this year, but no scientific article yet to follow this up.