To show advantages of properly cleaned and trimmed sequencing reads we perfomed several assemblies and present their comparison to the previously published results. Unfortunately, the numbers do not show that the 3rd-party assemblies contain assembly chimeras. However, there are two types of issue with assemblies.
- Unremoved adaptor/MID/artefact caused false contig join, possibly when the requirement for an overlap was set too low and the contaminating sequence was longer. In this case the unremoved contaminants promoted contig joining and the bad assembly appears to have fever contigs than a stricter-one. A naive user is happy with this assembly as it seems to have the lowest contig count.
- Unremoved contaminants prevented contig joining. If the assembly was strict enough it resulted in many contigs which do not assemble together because the adaptor/MID/artefact sequences are well shorter than the minimal overlap required for a join. The unremoved sequences are flanking on either side of a contig and prevent contig joining.
In real world assemblies, both phenomenons compete with each other in every assembly project. Interestingly, even a stricter assembly but with proper trimming results in better assemblies. The improvements are sometimes not drammatic but that only underscores that the simple statistics usually presented (number of contigs, scaffolds, their average lengths, etc.) does not describe the key properties of assemblies and definitely not their quality. The positive message is that when one removes the contaminating sub-sequences (causing false joins or even preventing any joins, see above) the assembler can still find new/different paths for merging the reads and in overall, cleaned datasets do assemble better, faster and take less memory, even under stricter criteria.
Microbacterium (5.0Mbp genome size, FLX+ whole genome shotgun and FLX+ 3.3kbp paired-end reads)
An example of Microbacterium project shows that one is asking for misassemblies when is using default assembly settings (newbler 2.8) with overlap requirement only 40 nt (the usual artefacts are close in length to this threshold or even a bit longer so they are easy to overlap and merge). With increased overlap length threshold one yields even slightly better assembly (still uncleaned dataset). For sure less false contigs were joined but still, suboptimal. If one stays with the stricter assembly settings and takes a cleaned version of the dataset the results are in some way better (one scaffold less) and in some way not (the longest scaffold contigs broke into two). Important is to realize that the numbers hardly describe quality of the assembly. Finally, an automatically tuned cleaned dataset with adjusted trim points results in fairly nice assembly, with a single scaffold, considerably lower count of large contigs, compared to just cleaned assembly it has much higher largest scaffold contig size, average scaffold contig size and average large contig size (the last column in green).
Mis-assembled (defaults) | Uncleaned (stricter assembly) | Cleaned & autotuned | |
all contigs (>500b) | 208 | 204 | 97 |
Large contigs (>1kb) | 107 | 105 | 59 |
Large avgContigSize | 35231 | 35875 | 63764 |
scaffold contigs | 88 | 85 | 16 |
avgScaffoldContigSize | 42617 | 44071 | 237784 |
Scaffolds (>2kb) | 4 | 3 | 1 |
largest scaffold contig | 341414 | 340802 | 956203 |
largest scaffold size | 2266091 | 3118929 | 3810055 |
scaffold contig bases | 3750306 | 3746084 | 3804556 |
numWithPairedRead | 81454 | 81454 | 106192 |
numberWithBothMapped | 77702 | 77904 | 102053 |
numAlignedReads | 935427 | 936174 | 968529 |
Acinetobacter sp. SH024 genome, ASM16363v1
SRP000509 Acinetobacter sp. SH024 | ASM16363v1 http://www.ncbi.nlm.nih.gov/assembly/GCA_000163635.1#/st | Cleaned |
all contigs (>500b) | 89 | 83 |
Large contigs (>1kb) | 69 | |
Large avgContigSize | 56734 | |
scaffolds | 26 | 13 |
avgScaffoldsize | 302821 | |
N50ScaffoldSize | 870269 | 2314520 |
largestScaffoldSize | 2314520 | |
largestContigSize | 289152 | |
N50ContigSize | 86799 | 109286 |
Total assembled sequence | 3970841 | 3918191 |
Oncorhynchus mykiss (a rainbow trout, fish) transcriptome, SRP005674
author’s assembly | Cleaned | Cleaned + Optimized | ||
all contigs (>500b) | 55793 | 33034 | 32692 | |
Large contigs (>1kb) | 9321 | 9452 | ||
Large avgContigSize | 778 | 777 | ||
Isogroups | 21317 | 21104 | ||
Isotigs | 24634 | 24506 | ||
avgIsotigSize | 585 | 632 | ||
largestContigSize | 27021 | 27021 |
Corvus corone transcriptome, SRP000770
Newbler 2.0.00.20 (40nt overlap 90identity) | Newbler 3.0 (80nt overlap 96identity, nourt, cleaned) | |
numberOfIsogroups | 6026 | |
numberOfIsotigs | 6142 | |
avgIsotigSize | 652 | |
N50IsotigSize | 756 | |
largestIsotigSize | 4625 | |
LargeContigs | 2948 | |
LargeAvgContigSize | 939 | |
LargeN50ContigSize | 990 | |
numberOfAllContigs | 19552 | 6257 |
inputFileNumReads | 387025 | |
numAlignedReads | 227874 | |
numAlignedBases | 48814890 | |
numberAssembled | 207172 | |
numberPartial | 20699 | |
numberSingleton | 111471 | |
numberRepeat | 16 | |
numberOutlier | 1620 | |
numberTooShort | 10713 |
Vulpes vulpes transcriptome (SRP005414, both tame + aggressive fox samples assembled together)
Newbler 2.3 (40nt_overlap_90identity) |
Newbler 2.3 (40nt_overlap_90identity, cleaned) |
Newbler 3.0 (40nt_overlap_90identity_nourt, cleaned) |
Newbler 3.0 (80nt_overlap_90identity_nourt, cleaned) |
|||||
Kukekova et al., 2011 |
||||||||
numberOfIsogroups |
59731 |
53079 |
40246 |
38574 |
||||
numberOfIsotigs |
87400 |
88529 |
75591 |
73693 |
||||
avgIsotigSize |
1820 |
2018 |
2201 |
2230 |
||||
N50IsotigSize |
3293 |
3563 |
3654 |
3628 |
||||
largestIsotigSize |
17286 |
17378 |
15988 |
|||||
LargeContigs |
58596 |
47819 |
47200 |
|||||
LargeAvgContigSize |
1185 |
1358 |
1352 |
|||||
LargeN50ContigSize |
1419 |
1642 |
1624 |
|||||
allContigMetrics |
98296 |
88124 |
85527 |
|||||
inputFileNumReads |
5945235 |
5937702 |
85.41% |
5945235 |
5945235 |
|||
numAlignedReads |
5071311 |
85.80% |
5093099 |
85.83% |
4859742 |
81.90% |
||
numAlignedBases |
1845165508 |
1840500903 |
85.83% |
1827008477 |
85.20% |
|||
numberAssembled |
4551142 |
76.60% |
4407089 |
4497111 |
75.79% |
4325940 |
72.90% |
|
numberPartial |
571726 |
9.60% |
639262 |
533416 |
8.99% |
533317 |
8.99% |
|
numberSingleton |
562591 |
9.50% |
582266 |
604808 |
10.19% |
874432 |
14.74% |
|
numberRepeat |
14192 |
0.20% |
28261 |
87355 |
1.47% |
14207 |
0.24% |
|
numberOutlier |
181469 |
3.10% |
210630 |
144330 |
2.43% |
119124 |
2.01% |
|
numberTooShort |
63630 |
1.10% |
70194 |
66655 |
1.12% |
66655 |
1.12% |
Ziziphus celata (a plant) transcriptome
author’s assembly | our assembly | |
total reads | 655337 | 655337 |
aligned reads | 443006 | |
assembled reads | 474025 | 388927 |
unassembled reads | 181312 | |
contigs | 84645 | |
avg contig length | 408 | |
all contigs (>500nt) | 24685 | |
large contigs (>1000nt) | 10072 | |
avg large contig length | 745 | |
largest contig | 5256 | |
isotigs | 22551 | |
avg isotig length | 615 | |
largest isotig | 14001 | |
isogroups | 19945 |
Phalaenopsis aphrodite (a plant) transcriptome
SRP005898, Plant & Cell Physiology | ||
author’s assembly | our assembly | |
total reads | 3302528 | 3302765 |
aligned reads | 2944414 | |
assembled reads | 2676907 | |
singletons | 85144 | 284568 |
Contigs (>200nt, >500nt, respectively) | 34563 | 39867 |
avg contig length (>200nt) | 1194 | |
all contigs (>500nt) | 39867 | |
large contigs (>1000nt) | 20946 | |
avg large contig length | 1239 | |
largest contig | 11757 | |
isotigs | 30861 | |
avg isotig length | 1483 | |
largest isotig | 11757 | |
isogroups | 20578 |