Fungal genome assembly from short-read sequences

This is a research blog so I though I’d post some quick numbers we are seeing for de novo assembly of the Neurospora crassa genome using Velvet. The genome of N.crassa is about 40Mb and sequencing of several flow cells using Solexa/Illumina technology to see what kind of de novo reconstruction we’d get. I knew that this is probably insufficient for a very good assembly given what has been reported in the literature, but sometimes it is helpful to give it a try on local data.  Mostly this is a project about SNP discovery from the outset. I used a hash size of 21 in velvet with an early (2FC) and later (4FC) dataset. Velvet was run with a hashsize of 21 for these data based on some calculations and running it with different hash sizes to see the optimal N50.  Summary contig size numbers come from the commands using cndtools from Colin Dewey.

  faLen < contigs.fa | stats

2 flowcells (~10M reads @36bp/read or about 10X coverage of 40Mb genome)

            N = 199562
          SUM = 25463251
          MIN = 49
 1ST-QUARTILE = 87
       MEDIAN = 107.0
 3RD-QUARTILE = 146
          MAX = 5371
         MEAN = 127.59568956
          N50 = 130

4 flow cells  (~20M reads @36bp/read; or about 20X coverage of a 40Mb genome)

            N = 102437
          SUM = 38352075
          MIN = 41
 1ST-QUARTILE = 77.0
       MEDIAN = 153
 3RD-QUARTILE = 467
          MAX = 7189
         MEAN = 374.396702363
          N50 = 837

So that’s N50 of 837bp – for those used to seeing N50 on the order or 1.5Mb this is not great.  But from4 FC worth of sequencing which was pretty cheap.  This is a reasonably repeat-limited genome so we should get pretty good recovery if the seq coverage is high enough. Using Maq we can both scaffold the reads and recover a sufficient number of high quality SNPs for the mapping part of the project.

To get a better assembly one would need much deeper coverage as Daniel and Ewan explain in their Velvet paper and shown in Figure 4 (sorry, not open-access for 6 mo). Full credit: This sequence was from unpaired sequence reads from Illumina/Solexa Genomic sequencing done at UCB/QB3 facility on libraries prepared by Charles Hall in the Glass lab.

8 thoughts on “Fungal genome assembly from short-read sequences”

  1. Any reason you’re doing a while genome for a de novo studied? Is there a reference genome, and you’re just playing around with de novo assembly? Or is de novo your only option? If there is no reference genome, why not sequence a cDNA library rather than the entire genome — cDNA libraries would be much easier to assemble, I imagine.

  2. There is definitely a reference genome.  This data was generated for getting SNPs mostly.  cDNA sequencing would be the next approach if just wanted genes, but there can be difficulties getting lowly expressed genes to contig well enough I imagine. For phylogenetic studies this will be pretty useful I imagine. Something we’ll be tackling empirically in the near future – am waiting for sequencing runs to finish.However all that said, at some point I’d just like to be able to tackle whole-genome de novo though for clades that don’t have references. I would imagine it is better to do a mix of 454 and Solexa to get genomic working best, but as I said, I wanted to try with what we had. It also helpful to have hard numbers to point to when someone asks what they can get out of 2 or 4 Flowcells worth of genomic sequencing on a small genome.

  3. Jason did you try euler-sr and edena as well? Though 20X is simply too low for Solexa, but still it would be interesting to see those comparative results. If you write to Bastien, you can get the Solexa version of his assembler (MIRA) as well to play with.

  4. I was scanning the web for the latest on short read assembly, and this sparked my interest. I worked a little bit on de novo assembly algorithms (I wrote VCAKE, which was obsolete about a week after it was published), and like to check up on it. Did you try SSAKE 2.0 on this data? My impression is that Velvet, despite it’s more advanced algorithms, may not actually have that much improved results. Particularly it may do poorly in situations of lower coverage, somewhere SSAKE 2.0 is designed to shine.
    Good luck playing around with it.

  5. VCAKE just took way to long and too much memory to finish on the 4 FlowCell run. Our experience on the 2FC dataset was that it did give slightly longer contigs (don’t have the numbers in front of me), but that resource requirements and performance was just not up to par.  I haven’t give SSAKE 2.0 a try at the current time, but I’m sure people doing assembler bake-offs will have so will see what’s cooking.

    Basically, this was a proof of principal that 20X is not enough coverage to do much with on the de novo front and I wanted numbers to point to.

  6. You should definitely give SSAKE a shot (v3.2.1), just quality-trim your reads first. I wrote a utility (TQS.py) to do this on a per-tile basis and it is included with the SSAKE release.
    I think you should be able to get 5-10kbp N50s with the 20x depth, it really depends on the number and size of repeats you genome has (%repeats in your genome?) . Also, do you have paired-end reads? SSAKE 3.0+ has PET support for building scaffolds.

    Will: VCAKE wasn’t obsolete 1 week later – a group used it & published in Nature Methods shortly after:

    [HTML] ?Stem cell transcriptome profiling via massive-scale mRNA sequencing
    N Cloonan, ARR Forrest, G Kolle, BBA Gardiner, GJ … – Nature Methods, 200

  7. Came across this forum while googling.  I’m not so surprised about the inferior results of EULER with 20X coverage.  Some work is done behind the scenes now to deal with this.  Some of the problems should go away both with EULER and Velvet when 50+ base reads are used since many gaps happen due to the not quite uniform sampling that happens.  

Leave a Reply