Norwich, United Kingdom
February 10, 2015
The high-performance computing workflow system, ‘RAMPART’, enables researchers to design and execute their own assembly workflows using a set of third-party open-source bioinformatics tools to provide improved results for their particular genome assembly projects.
De novo genome assembly is the process of reconstructing full-length chromosomes from the shorter genomic fragments produced by sequencing devices. The process has been described as like putting together a multi-million piece jigsaw puzzle. The assembly algorithms use overlapping information in the sequenced reads (jigsaw pieces) to reconstruct longer genomic sequences (larger chunks of the final picture). This is a common task in the study of many non-model organisms, as having longer sequences allows scientists to do many other downstream analyses, such as identifying genes, or making comparisons to other individuals or organisms.
The de novo genome assembly process is a complex task and typically involves testing multiple tools, parameters and approaches to produce the best possible assembly of the available data. This is because it is not always known beforehand which tools and settings will work best on the available sequence data given the organism’s specific genomic properties, such as genome size, ploidy (number of sets of chromosomes in the nucleus of a cell) and the composition of repetitive genomic content. Despite advances in computing hardware, algorithms and sequencing technologies, de novo assembly, particularly for more complex eukaryotic genomes, remains a challenging task.
Recently, several tools, such as ‘iMetAMOS’ and ‘A5’ approach this problem by exhaustively testing many tools in parallel and then identifying and selecting the best assembly. However, these pipelines focus on bacterial genomes, where the computational demands are more manageable and the genomes are smaller and generally easier to assemble. Larger, more complex, genomes require more computational power and prohibit exhaustive testing of all tools and parameters with current computing hardware. For these projects, the user must use the literature and their own experience to decide which possibilities are worth considering.
The new workflow system, RAMPART, led by Daniel Mapleson at TGAC, allows the user to design and execute their own assembly workflows using a set of third-party open source bioinformatics tools. This reduces human error and relieves the burden of organising data files and executing tools manually, frequently helping to produce better assemblies more efficiently.
RAMPART gives the user the freedom to compare tools and parameters to identify the effect these have on the given data sets. The flexibility to roll-your-own workflow enables the user to tackle even complex assembly projects, tailoring the amount of work to be done based on the availability of computing resources, quantity of sequence data and complexity of the genome. In addition, RAMPART produces logs, metrics and reports throughout the workflow, which allows users to identify, and subsequently rectify, problems that may occur.
Daniel Mapleson, Analysis Pipelines Project Leader in the Regulatory & Environmental Genomics group at TGAC, said: “RAMPART helps us to speed up de novo genome assembly projects by helping to coordinate available sequence data and pass it through a multi-stage pipeline, comprising of existing tried and trusted assembly tools. It provides a mechanism to systematically analyse the quality of multiple assemblies produced using various tools and settings, without requiring a reference genome sequence, making it particularly suitable for genome assembly projects of non-model organisms. This software should help to reduce the costs of producing high-quality draft genome assemblies in the future."
The scientific paper, titled: “RAMPART: a workflow management system for de novo genome assembly” is published in Oxford Journals Bioinformatics.