As part of background in preparing a grant I ended up writing a few scripts to see the distribution of fungal species with ITS data in GenBank. The whole spreadsheet of the data is public and available here and I walk you through the data generation and summary below.
ITS (Internal Transcribed Spacer) is the typically used barcode for identifying fungi at the species level as it works for most (but not all) groups of Fungi. It falls between highly conserved nuclear rDNA genes (18S, 5.8S, 28S) but tends to be hypervariable making it a reasonable locus for identification of species since it tends to be unique between species but fairly unchanged among individuals from the same species. You can see a Map of the amplified region from Tom Brun’s site or info at Rytas Vilgalys’s site among others.
The script to extract these and dump the numbers from GenBank uses Perl, BioPerl, and is plotted in a Google docs table. I queried for all ITS sequences with a pretty simple query – some people use a better more thorough query to get the list of GIs so I separated the GI query from the statistics about taxonomy.
The GI query code uses BioPerl and queries GenBank over the web to dump out a file of GI numbers The code is in this Perl script.
This generates a file with GI (genbank identifiers) numbers for nucleotide records. This is not cleaned up to remove problematic seqs, but since we’re interested in overall statistics, I don’t think is that important if there are some records with problem. You might want to do some cleanup of these data and expand the query before using it as a reference ITS database for your BLAST queries. See tools built by Henrik Nilsson and others like Emerencia for some of the cleanup and detection of problems with a dataset like this of ITS.
But given a list of GIs from any query – in our case of ITS sequences – what is the distribution of taxa (based on what is specified by the submitted which is not always correct!)? Of course some aren’t specified to the species level or even to the genus level so the code has to be smart enough to put those in a different category. But of those specified to a particular taxonomic level – what are they? This script tallies the information about the phyla and genus and dumps them out – it takes a while to run the first time because it must build a database for all the GI to taxon record links (gi_to_taxa_nucl.dmp file from ncbi taxonomy) so be prepared to wait a while and dedicate several dozen gigabytes to get this all working the first time.
So what is the most abundant deposited genus? Well according to this analysis it is Fusarium. Which are found everywhere especially in soil. This distribution may have much more to do with the types of places being sampled and the types of questions researchers are working on rather than about relative abundance worldwide so take it as an interesting observation of what is in the databases! Only in particular environments with dedicated studies to fungal species (for example, the indoor environment or a particular area of a forest or fungi associated with trees in an urban and rural environment or one of many other studies not mentioned) can we really say something. What is important to note also is the massively parallel sequencing studies using 454 are coming online and not necessarily being dumped directly into this particular database at GenBank – these number represent the mainly Sanger clone sequenced data from years past, but it will be a whole new ball game in the next few years as studies start doing 454 sequencing as primary means to identify community structure.
So who is generating all that data — well I wrote another version of the script which dumps out the authors for records from a particular taxa by querying the genbank record for the author field of all the records that came from a particular taxa.
The data are in this spreadsheet.
So a few bits of code using queries of GenBank and BioPerl to link things together, hope you see some sense of what is out there and maybe can think of interesting variations on this theme to address other data mining questions.