Tag Archives: high throughput

A HyphalTip: Get a bunch of SRA data

[Trying to post some simple tips from time-to-time, they’ll be in this same category]

For those wanting to get a large dataset from NCBI SRA trace archive, you may be annoyed to click on each link and wait for the download. For example if you read a recent paper on Population Genomics in Neurospora crassa and saw the 48 RNA-Seq datasets which is accession number SRP004848, you might be interested to download all these data for your own re-analysis, as a great dataset for teaching, or even to see how the splicing and expression looks for your favorite gene.

I put together some scripts in this folder to make it easy to see how to download, rename, and extract fastq from these data.  In a future HyphalTip, I’ll detail how you can map these to the genome to get expression values and improve gene annotation.

For cmdline users, after installing the aspera plugin, you can do this to download the sra light data.  Aspera will give you 10-100x speedup – I was downloading at 300 MB/s vs 5Mb/s with wget or ncftp (FTP download). See Morgan’s info about downloading as well. You can run this command or see it in this script.

hyphaltip $ ~/.aspera/connect/bin/ascp -k 1 -l 300M -QTr -i ~/.aspera/connect/etc/asperaweb_id_dsa.putty \
anonftp@ftp-private.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByStudy/litesra/SRP/SRP004/SRP004848 ./

Now the IDs you will have for all these files will be for the SRA accessions not the original strain numbers which may be more informative/meaningful. This can be confusing because the accession for a particular experiment is SRR080688 is the strain D110/FGSC 8870. You can get the list of the trace accession and the strain IDs in this link. Sadly there isn’t 1 place to get the SRR accession to strain name, but if you download the brief listing and then the FTP listing (as text) you can map from each ID to the other and make this file with a little bit of Perl-Fu. You can then run this renaming script which will help you rename the files using the remapping file.

Finally to convert these to fastq you can run this dump_fastq.sh script which will dump out fastq files suitable for use in the future steps. These files are all single-end RNA-Seq but if it had been paired end, a _1.fastq and _2.fastq file would have been made for every SRA file (plus a .fastq one for reads that failed in someway).

cd SRP004848
rmdir SRR*
for file in *.lite.sra
  base=`basename $file .lite.sra` 
  fastq-dump -alt 1 -A $base $file

Together this is a pretty automated way to get the data you want, once you’ve figured out which accessions you need and how you want to rename them.