Example results

To show advantages of properly cleaned and trimmed sequencing reads we perfomed several assemblies and present their comparison to the previously published results. Unfortunately, the numbers do not show that the 3rd-party assemblies contain assembly chimeras. However, there are two types of issue with assemblies.

Unremoved adaptor/MID/artefact caused false contig join, possibly when the requirement for an overlap was set too low and the contaminating sequence was longer. In this case the unremoved contaminants promoted contig joining and the bad assembly appears to have fever contigs than a stricter-one. A naive user is happy with this assembly as it seems to have the lowest contig count.
Unremoved contaminants prevented contig joining. If the assembly was strict enough it resulted in many contigs which do not assemble together because the adaptor/MID/artefact sequences are well shorter than the minimal overlap required for a join. The unremoved sequences are flanking on either side of a contig and prevent contig joining.

In real world assemblies, both phenomenons compete with each other in every assembly project. Interestingly, even a stricter assembly but with proper trimming results in better assemblies. The improvements are sometimes not drammatic but that only underscores that the simple statistics usually presented (number of contigs, scaffolds, their average lengths, etc.) does not describe the key properties of assemblies and definitely not their quality. The positive message is that when one removes the contaminating sub-sequences (causing false joins or even preventing any joins, see above) the assembler can still find new/different paths for merging the reads and in overall, cleaned datasets do assemble better, faster and take less memory, even under stricter criteria.

Microbacterium (5.0Mbp genome size, FLX+ whole genome shotgun and FLX+ 3.3kbp paired-end reads)

An example of Microbacterium project shows that one is asking for misassemblies when is using default assembly settings (newbler 2.8) with overlap requirement only 40 nt (the usual artefacts are close in length to this threshold or even a bit longer so they are easy to overlap and merge). With increased overlap length threshold one yields even slightly better assembly (still uncleaned dataset). For sure less false contigs were joined but still, suboptimal. If one stays with the stricter assembly settings and takes a cleaned version of the dataset the results are in some way better (one scaffold less) and in some way not (the longest scaffold contigs broke into two). Important is to realize that the numbers hardly describe quality of the assembly. Finally, an automatically tuned cleaned dataset with adjusted trim points results in fairly nice assembly, with a single scaffold, considerably lower count of large contigs, compared to just cleaned assembly it has much higher largest scaffold contig size, average scaffold contig size and average large contig size (the last column in green).

	Mis-assembled (defaults)	Uncleaned (stricter assembly)	Cleaned & autotuned
all contigs (>500b)	208	204	97
Large contigs (>1kb)	107	105	59
Large avgContigSize	35231	35875	63764
scaffold contigs	88	85	16
avgScaffoldContigSize	42617	44071	237784
Scaffolds (>2kb)	4	3	1
largest scaffold contig	341414	340802	956203
largest scaffold size	2266091	3118929	3810055
scaffold contig bases	3750306	3746084	3804556
numWithPairedRead	81454	81454	106192
numberWithBothMapped	77702	77904	102053
numAlignedReads	935427	936174	968529

Acinetobacter sp. SH024 genome, ASM16363v1

SRP000509 Acinetobacter sp. SH024	ASM16363v1 http://www.ncbi.nlm.nih.gov/assembly/GCA_000163635.1#/st	Cleaned
all contigs (>500b)	89	83
Large contigs (>1kb)		69
Large avgContigSize		56734
scaffolds	26	13
avgScaffoldsize		302821
N50ScaffoldSize	870269	2314520
largestScaffoldSize		2314520
largestContigSize		289152
N50ContigSize	86799	109286
Total assembled sequence	3970841	3918191

Oncorhynchus mykiss (a rainbow trout, fish) transcriptome, SRP005674

	author’s assembly	Cleaned	Cleaned + Optimized
all contigs (>500b)	55793	33034	32692
Large contigs (>1kb)		9321	9452
Large avgContigSize		778	777
Isogroups		21317	21104
Isotigs		24634	24506
avgIsotigSize		585	632
largestContigSize		27021	27021

Corvus corone transcriptome, SRP000770

	Newbler 2.0.00.20 (40nt overlap 90identity)	Newbler 3.0 (80nt overlap 96identity, nourt, cleaned)

numberOfIsogroups		6026
numberOfIsotigs		6142
avgIsotigSize		652
N50IsotigSize		756
largestIsotigSize		4625
LargeContigs		2948
LargeAvgContigSize		939
LargeN50ContigSize		990
numberOfAllContigs	19552	6257
inputFileNumReads		387025
numAlignedReads		227874
numAlignedBases		48814890
numberAssembled		207172
numberPartial		20699
numberSingleton		111471
numberRepeat		16
numberOutlier		1620
numberTooShort		10713

Vulpes vulpes transcriptome (SRP005414, both tame + aggressive fox samples assembled together)

	Newbler 2.3 (40nt_overlap_90identity)		Newbler 2.3 (40nt_overlap_90identity, cleaned)		Newbler 3.0 (40nt_overlap_90identity_nourt, cleaned)		Newbler 3.0 (80nt_overlap_90identity_nourt, cleaned)
	Kukekova et al., 2011
numberOfIsogroups	59731		53079		40246		38574
numberOfIsotigs	87400		88529		75591		73693
avgIsotigSize	1820		2018		2201		2230
N50IsotigSize	3293		3563		3654		3628
largestIsotigSize			17286		17378		15988
LargeContigs			58596		47819		47200
LargeAvgContigSize			1185		1358		1352
LargeN50ContigSize			1419		1642		1624
allContigMetrics			98296		88124		85527
inputFileNumReads	5945235		5937702	85.41%	5945235		5945235
numAlignedReads			5071311	85.80%	5093099	85.83%	4859742	81.90%
numAlignedBases			1845165508		1840500903	85.83%	1827008477	85.20%
numberAssembled	4551142	76.60%	4407089		4497111	75.79%	4325940	72.90%
numberPartial	571726	9.60%	639262		533416	8.99%	533317	8.99%
numberSingleton	562591	9.50%	582266		604808	10.19%	874432	14.74%
numberRepeat	14192	0.20%	28261		87355	1.47%	14207	0.24%
numberOutlier	181469	3.10%	210630		144330	2.43%	119124	2.01%
numberTooShort	63630	1.10%	70194		66655	1.12%	66655	1.12%

Ziziphus celata (a plant) transcriptome

	author’s assembly	our assembly
total reads	655337	655337
aligned reads		443006
assembled reads	474025	388927
unassembled reads	181312
contigs	84645
avg contig length	408
all contigs (>500nt)		24685
large contigs (>1000nt)		10072
avg large contig length		745
largest contig		5256
isotigs		22551
avg isotig length		615
largest isotig		14001
isogroups		19945

Phalaenopsis aphrodite (a plant) transcriptome

	SRP005898, Plant & Cell Physiology
	author’s assembly	our assembly
total reads	3302528	3302765
aligned reads		2944414
assembled reads		2676907
singletons	85144	284568
Contigs (>200nt, >500nt, respectively)	34563	39867
avg contig length (>200nt)	1194
all contigs (>500nt)		39867
large contigs (>1000nt)		20946
avg large contig length		1239
largest contig		11757
isotigs		30861
avg isotig length		1483
largest isotig		11757
isogroups		20578

Newbler 2.3 (40nt_overlap_90identity)

Newbler 2.3 (40nt_overlap_90identity, cleaned)

Newbler 3.0 (40nt_overlap_90identity_nourt, cleaned)

Newbler 3.0 (80nt_overlap_90identity_nourt, cleaned)

Kukekova et al., 2011

numberOfIsogroups

59731

53079

40246

38574

numberOfIsotigs

87400

88529

75591

73693

avgIsotigSize

1820

2018

2201

2230

N50IsotigSize

3293

3563

3654

3628

largestIsotigSize

17286

17378

15988

LargeContigs

58596

47819

47200

LargeAvgContigSize

1185

1358

1352

LargeN50ContigSize

1419

1642

1624

allContigMetrics

98296

88124

85527

inputFileNumReads

5945235

5937702

85.41%

5945235

5945235

numAlignedReads

5071311

85.80%

5093099

85.83%

4859742

81.90%

numAlignedBases

1845165508

1840500903

85.83%

1827008477

85.20%

numberAssembled

4551142

76.60%

4407089

4497111

75.79%

4325940

72.90%

numberPartial

571726

9.60%

639262

533416

8.99%

533317