This is a research blog so I though I’d post some quick numbers we are seeing for de novo assembly of the Neurospora crassa genome using Velvet. The genome of N.crassa is about 40Mb and sequencing of several flow cells using Solexa/Illumina technology to see what kind of de novo reconstruction we’d get. I knew that this is probably insufficient for a very good assembly given what has been reported in the literature, but sometimes it is helpful to give it a try on local data. Mostly this is a project about SNP discovery from the outset. I used a hash size of 21 in velvet with an early (2FC) and later (4FC) dataset. Velvet was run with a hashsize of 21 for these data based on some calculations and running it with different hash sizes to see the optimal N50. Summary contig size numbers come from the commands using cndtools from Colin Dewey.
faLen < contigs.fa | stats
2 flowcells (~10M reads @36bp/read or about 10X coverage of 40Mb genome)
N = 199562 SUM = 25463251 MIN = 49 1ST-QUARTILE = 87 MEDIAN = 107.0 3RD-QUARTILE = 146 MAX = 5371 MEAN = 127.59568956 N50 = 130
4 flow cells (~20M reads @36bp/read; or about 20X coverage of a 40Mb genome)
N = 102437 SUM = 38352075 MIN = 41 1ST-QUARTILE = 77.0 MEDIAN = 153 3RD-QUARTILE = 467 MAX = 7189 MEAN = 374.396702363 N50 = 837
So that’s N50 of 837bp – for those used to seeing N50 on the order or 1.5Mb this is not great. But from4 FC worth of sequencing which was pretty cheap. This is a reasonably repeat-limited genome so we should get pretty good recovery if the seq coverage is high enough. Using Maq we can both scaffold the reads and recover a sufficient number of high quality SNPs for the mapping part of the project.
To get a better assembly one would need much deeper coverage as Daniel and Ewan explain in their Velvet paper and shown in Figure 4 (sorry, not open-access for 6 mo). Full credit: This sequence was from unpaired sequence reads from Illumina/Solexa Genomic sequencing done at UCB/QB3 facility on libraries prepared by Charles Hall in the Glass lab.