Tag Archives: tip

Annotating genes in FungiDB

As part of our newly launched FungiDB I wanted to demonstrate the capabilities of attaching additional annotation to gene records.  This is an important feature that allows the community to contribute to the knowledge base about an organism’s genes and gene functions.  This can be done by anyone in the community who has registered an account. Once the information is entered is also searchable so that added gene names are immediately part of the searchable text across the entire database.

We encourage you to register for an account (which allows you to also save your strategies and queries) and also to provide feedback about the look and feel (interface), the datasets that are provided, the need for additional species, and the capabilities and tools in the system.

Linking to a gene page
Here is a record for the C. neoformans var. grubii gene SXI1.  If you look at that URL you can see I am using one of the built-in shortcut URLs for EuPathDB/FungiDB which is http://fungidb.org/gene/GENENAME. Since we need names to be unique across the whole database we have chosen to prefix names with the species and strain prefix separated by two underscores.  So the link is  http://fungidb.org/gene/CneoH99__CNAG_06814 which gets expanded into a full accessible record which is much longer as you’ll see when you click on it.  This redirection gives us some flexibility to indicates that we’re talking to the beta release version of the database at this time.  This full (and long URL) has some flexibility so that in the future we can still have a beta and production server or even frozen release servers.

SXI1 top
Screen shot of SXI1 page on FungiDB

Comment pages
Comment pages provide a way to add additional information from the community to a page. This can be as simple as the gene name assigned, a link to a publication, GenBank Accession of the sequence, or descriptive text about a mutant phenotype. The comments can also describe corrections or additions to the gene model which can be noted. Currently the EuPathDB system does not allow direct manipulation of the gene model from the page as these types of updates need vetting and technical work to re-analyze the updated protein sequence in the downstream analyses (e.g. Pfam, InterPro, homology assignments, etc). For SXI1 there is a comment which links the gene to several publications and submitted accession numbers.  Since multiple strains have been studied for this gene, but not all are part of FungiDB, it may be useful to have direct links to the sequence variants for someone wanting to do followup work on the gene and the key residues.

SXI1 comment
How a comment looks on a gene page when it is entered
SXI1 comment
A comment page showing the linked pubmed records and genbank accessions

A HyphalTip: Get a bunch of SRA data

[Trying to post some simple tips from time-to-time, they’ll be in this same category]

For those wanting to get a large dataset from NCBI SRA trace archive, you may be annoyed to click on each link and wait for the download. For example if you read a recent paper on Population Genomics in Neurospora crassa and saw the 48 RNA-Seq datasets which is accession number SRP004848, you might be interested to download all these data for your own re-analysis, as a great dataset for teaching, or even to see how the splicing and expression looks for your favorite gene.

I put together some scripts in this folder to make it easy to see how to download, rename, and extract fastq from these data.  In a future HyphalTip, I’ll detail how you can map these to the genome to get expression values and improve gene annotation.

For cmdline users, after installing the aspera plugin, you can do this to download the sra light data.  Aspera will give you 10-100x speedup – I was downloading at 300 MB/s vs 5Mb/s with wget or ncftp (FTP download). See Morgan’s info about downloading as well. You can run this command or see it in this script.

hyphaltip $ ~/.aspera/connect/bin/ascp -k 1 -l 300M -QTr -i ~/.aspera/connect/etc/asperaweb_id_dsa.putty \
anonftp@ftp-private.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByStudy/litesra/SRP/SRP004/SRP004848 ./

Now the IDs you will have for all these files will be for the SRA accessions not the original strain numbers which may be more informative/meaningful. This can be confusing because the accession for a particular experiment is SRR080688 is the strain D110/FGSC 8870. You can get the list of the trace accession and the strain IDs in this link. Sadly there isn’t 1 place to get the SRR accession to strain name, but if you download the brief listing and then the FTP listing (as text) you can map from each ID to the other and make this file with a little bit of Perl-Fu. You can then run this renaming script which will help you rename the files using the remapping file.

Finally to convert these to fastq you can run this dump_fastq.sh script which will dump out fastq files suitable for use in the future steps. These files are all single-end RNA-Seq but if it had been paired end, a _1.fastq and _2.fastq file would have been made for every SRA file (plus a .fastq one for reads that failed in someway).

cd SRP004848
rmdir SRR*
for file in *.lite.sra
  base=`basename $file .lite.sra` 
  fastq-dump -alt 1 -A $base $file

Together this is a pretty automated way to get the data you want, once you’ve figured out which accessions you need and how you want to rename them.