Continuous analyses of over 1800 of Roche/454-based sequencing datasets led to a collection of 12045 sequences resembling plethora of ever used adapters/primers/MIDs along with their artifacts. They are groupped by laboratory protocols causing them. It turned out there are at least 35 different laboratory protocol scenarios and about the same number of sample preparation approaches used in conjunction with 454 technology. Thorough analysis of the data in concert with deep understanding of molecular biology techniques involved in laboratory protocols used for sample preparation led us to a conclusion that most likely all current large-scale, next generation sequencing approaches are sometimes highly error- and artifact-prone. Users often refer to “failed sequencing runs” and these failed setups are quite interesting for analysis.
In the end, it became surprising that one could come up with hopefully definitive listing of query sequences to filter all the crap and that the computations finish within a reasonable time. The software pipeline creates final query sequences specifically for each laboratory protocol to make the searches as much sensitive as possible while keeping pace at a reasonable speed. The construction of dynamically generated queries is a truly unique feature of our software pipeline (unpublished). We offer a data cleanup service and QC control of any 454-based data. In some cases, also IonTorrent/IonProton and Illumina datasets can be processed (notably transcriptomic datasets based on Clontech/Evrogen protocols or even just IonTorrent cDNA preps).
Current research focus is to extend coverage of the tool to other sequencing technologies. Further, work is ongoing to publish corrected assemblies of several third party datasets, taking advantage of the properly cleaned raw data.