Tag Archives: ncbi

Job: Fungal RefSeq Curator @NCBI

Fungi RefSeq curator position at NCBI

Original job post here.

Computercraft seeks a highly motivated individual who will use his or her biological expertise to support RefSeq sequence standards and to contribute functional annotation of both the sequence record and the companion resource, NCBI’s Gene database. The NCBI Reference Sequence (RefSeq) project provides reference sequence standards that are used internationally for genome annotation. RefSeqs provide a stable reference for gene characterization, mutation analysis, expression studies, and polymorphism discovery.

This is an exciting opportunity to contribute to the RefSeq project while using state-of-the art computational tools and databases.  Curators work on-site at the National Institutes of Health, National Center for Biotechnology Information (NCBI) in Bethesda, Maryland.


  • Evaluate and analyze sequence data from Fungi to provide the most complete and accurate reference sequences to define coding and non-coding transcripts, protein products, and genomic regions
  • Analyze phylogenetic trees supporting functional annotation and verifying species identification of genome data.Communicate with other scientists to ensure the highest quality data content for RefSeq records and the Gene database
  • Coordinate with model organism databases and other organism-specific interest groups to ensure timely processing of genomic sequence data and accurate display of annotation in NCBI resources
  • Collaborate with other scientists to expand the content for RefSeqs fungal genome ITS and rRNA records
  • Contribute toward NCBI initiatives to improve Fungal genome resources


  • Ph.D. in molecular biology and/or genomics of Fungi, or a related field
  • Postdoctoral experience
  • Extensive experience with functional genome annotation of Fungi
  • Extensive experience with evaluating structural annotation of Fungi genomes
  • Experience in phylogenetic analysis of fungal sequences
  • Strong logic, problem-solving, and organizational skills
  • Excellent verbal and written communication skills
  • Ability to work both independently and as part of a team
  • Ability to adhere to established procedures
  • A detail-oriented perspective
  • A strong desire to support public scientific databases such as RefSeq and Gene

This is an intellectually challenging, detail-oriented position which will provide an excellent opportunity to use your biology expertise in a non-laboratory position. For more information about the RefSeq project and Gene, please see:

RefSeq: http://www.ncbi.nlm.nih.gov/RefSeq/

A HyphalTip: Get a bunch of SRA data

[Trying to post some simple tips from time-to-time, they’ll be in this same category]

For those wanting to get a large dataset from NCBI SRA trace archive, you may be annoyed to click on each link and wait for the download. For example if you read a recent paper on Population Genomics in Neurospora crassa and saw the 48 RNA-Seq datasets which is accession number SRP004848, you might be interested to download all these data for your own re-analysis, as a great dataset for teaching, or even to see how the splicing and expression looks for your favorite gene.

I put together some scripts in this folder to make it easy to see how to download, rename, and extract fastq from these data.  In a future HyphalTip, I’ll detail how you can map these to the genome to get expression values and improve gene annotation.

For cmdline users, after installing the aspera plugin, you can do this to download the sra light data.  Aspera will give you 10-100x speedup – I was downloading at 300 MB/s vs 5Mb/s with wget or ncftp (FTP download). See Morgan’s info about downloading as well. You can run this command or see it in this script.

hyphaltip $ ~/.aspera/connect/bin/ascp -k 1 -l 300M -QTr -i ~/.aspera/connect/etc/asperaweb_id_dsa.putty \
anonftp@ftp-private.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByStudy/litesra/SRP/SRP004/SRP004848 ./

Now the IDs you will have for all these files will be for the SRA accessions not the original strain numbers which may be more informative/meaningful. This can be confusing because the accession for a particular experiment is SRR080688 is the strain D110/FGSC 8870. You can get the list of the trace accession and the strain IDs in this link. Sadly there isn’t 1 place to get the SRR accession to strain name, but if you download the brief listing and then the FTP listing (as text) you can map from each ID to the other and make this file with a little bit of Perl-Fu. You can then run this renaming script which will help you rename the files using the remapping file.

Finally to convert these to fastq you can run this dump_fastq.sh script which will dump out fastq files suitable for use in the future steps. These files are all single-end RNA-Seq but if it had been paired end, a _1.fastq and _2.fastq file would have been made for every SRA file (plus a .fastq one for reads that failed in someway).

cd SRP004848
rmdir SRR*
for file in *.lite.sra
  base=`basename $file .lite.sra` 
  fastq-dump -alt 1 -A $base $file

Together this is a pretty automated way to get the data you want, once you’ve figured out which accessions you need and how you want to rename them.

Schizosaccharomyces genomes

S.octosporusThe Broad Institute has made available the Schizosaccharomyces octosporus genome sequence producing another model system (S.pombe) with several related species for comparative genomics.  I believe S. octosporus genome was entirely sequenced with 454 technology.   The other genome sequences in the Taphrina clade include the S. japonicus genome. S. octosporus is pretty interesting as it grows filamentously and is 8-spored unlike S. pombe. The origin of this filamentous growth would be quite important to understand how reversions to simpler fission yeast forms form and whether this is loss of whole gene families or remodeling of gene networks.

There is also some preliminary (old) sequence from Pneumocystis (although it is hard to track down that sequence, a paper from 2006 says there is draft sequence but none shows up in GenBank).  

See also: