Next Article in Journal
Parallel Variants of Broyden’s Method
Next Article in Special Issue
Finding Supported Paths in Heterogeneous Networks
Previous Article in Journal
A CS Recovery Algorithm for Model and Time Delay Identification of MISO-FIR Systems
Previous Article in Special Issue
On String Matching with Mismatches
Article

Modified Classical Graph Algorithms for the DNA Fragment Assembly Problem

1
Engineering Department, Universidad Iberoamericana Ciudad de México, Prolongación Paseo de la Reforma 880, Lomas de Santa Fe, Distrito Federal 01219, Mexico
2
Universidad Iberoamericana Ciudad de México, DCI, Prolongación Paseo de la Reforma 880, Lomas de Santa Fe, Distrito Federal 01219, Mexico
3
Physics and Mathematics Department, Universidad Iberoamericana Ciudad de México, Prolongación Paseo de la Reforma 880, Lomas de Santa Fe, Distrito Federal 01219, Mexico
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Academic Editors: Giuseppe Lancia and Alberto Policriti
Algorithms 2015, 8(3), 754-773; https://0-doi-org.brum.beds.ac.uk/10.3390/a8030754
Received: 24 June 2015 / Revised: 6 August 2015 / Accepted: 26 August 2015 / Published: 10 September 2015
(This article belongs to the Special Issue Algorithmic Themes in Bioinformatics)

Abstract

DNA fragment assembly represents an important challenge to the development of efficient and practical algorithms due to the large number of elements to be assembled. In this study, we present some graph theoretical linear time algorithms to solve the problem. To achieve linear time complexity, a heap with constant time operations was developed, for the special case where the edge weights are integers and do not depend on the problem size. The experiments presented show that modified classical graph theoretical algorithms can solve the DNA fragment assembly problem efficiently.
Keywords: DNA fragment assembly; minimum spanning tree; heap; linear complexity DNA fragment assembly; minimum spanning tree; heap; linear complexity

1. Introduction

Since its discovery by Watson and Crick [1], the importance of DNA to biology, medicine and human kind has been evident. However, we had to wait some time until it was possible to sequence the DNA bases, which was accomplished by Sanger in the mid-1970s [2], by detecting small dark bands on a thin gel layer. This method severely constrains the length of the DNA to be sequenced due to the limited resolution of the dark bands. In order to solve this problem, the DNA is first cut in known places by using restriction enzymes, which will divide the DNA at the point where a specific base sequence is found. The resulting sub-sequences, are divided again, this time using a different restriction enzyme, and this process is repeated until the resulting fragments can be divided in random places yielding sub-fragments that are small enough as to provide the complete sequence of bases using Sanger’s method. This process was slow and expensive because every time that a sequence was divided, each resulting subsequence had to be cloned in order to have enough material for the next division. This problem prompted scientists to start dividing sequences in random positions from the beginning, instead of doing it after several divisions, saving time and money [3]. This method, in turn, created a severe computational problem since there is no information about the order of the fragments. If one starts with many copies of the original problem and the derived fragments are sequenced, then each base of the original problem could appear in several fragments. The sum of the lengths of the fragments divided by the length of the original problem is known as its coverage. We hope that if we have a large enough coverage, the fragments will overlap in such a way that the sequence at the end of a fragment overlaps with the sequence at the beginning of another one. Hence, in theory, we should be able to use this information to sort the fragments and reconstruct the original DNA sequence.
What really happens is not very simple since, by chance, there are a huge number of false overlaps amongst the fragments. There is also the problem that fragment sequencing is not perfect. Two percent error is common, as well as having fragments that do not belong to the original problem, some due to contamination with DNA foreign to the problem, and others because of the appearance of chimeras from the process of DNA cloning, in which bacteria DNA is used, so that we could get fragments with portions of bacteria DNA. On top of what was said, we have repeats, which are DNA sections that appear tens or hundreds of times one after another (tandem repeats) or in far away parts of the original DNA (interspersed repeats) [4].
As far as we know, at this point there is no good probabilistic model of DNA. Simple, or relatively simple ones do not explain the repeats. For instance, in the real life problem that we will present later in this paper, there are sequences of A bases that are extremely long. Figure 1 show some examples of fragments that come from the sample problem that we will use in Section 4 of this paper and Table 1 shows the probability that it happens considering that among the total number of fragments, the A bases occurs with probability 0.3249. As can be seen from the figure, the number of times that the A chain occurs in the real data is much larger than the expected value according to a simple probabilistic model. Despite these anomalies and some others, some of the parameters that are used in the assembly of fragments, such as the minimum number of bases that an overlap must have to be considered important, have an empirical basis [5].
There is also the problem that in real situations, sections of the original DNA problem have no coverage since fragments are obtained by a random process and it is not possible guarantee that there are fragments from every part of the genome under study. This is the reason why the resulting assembly of all fragments is a set of contigs or subsequences and almost never the entire sequence of the problem. It is the job of specialized biologists to fill in the gaps and correct assembly errors made during the automated process.
Figure 1. Some fragments from the S. aureus strain MW2 problem.
Figure 1. Some fragments from the S. aureus strain MW2 problem.
Algorithms 08 00754 g001
Table 1. Observed frequency on n bases A vs. expected value for p = 0.3249 and 5,324,340 fragments.
Table 1. Observed frequency on n bases A vs. expected value for p = 0.3249 and 5,324,340 fragments.
nObservedProbabilityExpected Value
151010.0000004752.528889873
16740.0000001470.784225487
17650.0000000460.242640081
18510.0000000140.074884674
19404.32861E−090.023046972
20391.32807E−090.007071095
21184.06052E−100.002161959
22281.23662E−100.000658416
23143.74924E−110.000199622
24151.13089E−110.000060212
25223.3908E−120.000018054
26151.00958E−120.000005375
27222.9809E−130.000001587
28118.71281E−140.000000464
29152.51495E−140.000000134
30107.14487E−153.80417E−08
3191.98796E−151.05846E−08
3265.37563E−162.86217E−09
3341.3946E−167.42531E−10
3443.38757E−171.80366E−10
3517.29107E−183.88201E−11
DNA fragment sequencing technology has continued to advance while costs continue to go down. The original shotgun technology has been called “long reads” because the fragments would normally have more than 500 bases. Using new technologies, commonly known as Next Generation Sequencing, the cost to sequence a million bases is about $1 [6]. Unfortunately, the length of the fragments is shorter using this newer technology. Today, 150 bases reads are common with the Illumina sequencers. To compensate for the problems derived from the short size of each fragment, coverage of about 40 is needed instead of a coverage of about 10 in the case of long reads. If we consider the number of fragments to be sequenced with newer technologies, we need to assemble three to four million fragments to sequence a bacterium, while using long reads, only 50,000 fragments were enough. These are really bad news due to the combinatorial nature of the solutions to the problem.
Since the very beginning of DNA fragment sequencing, greedy algorithms have been used; for instance, we pick a random fragment and connect to it the one that has the longest overlap. This process is repeated until no overlapping fragments are available. Nowadays, this technique would be difficult to use, since fragments are too short, which often produces several different fragments with the same overlap size. Another popular technique is the use of De Bruijn graphs [7]. In order to build these graphs, all possible k-mers (subsequence of k consecutive bases) of each fragment are used to find all possible overlaps among them. Using these graphs, several problems can be solved by using heuristics; for instance, given a k-mer, there might be two parallel paths to other k-mer, creating what is called a bubble. The bubble is then popped by removing the shortest side or projecting one side onto the other when they have the same length. Short sequences that originate from long ones are eliminated. It is common to use a value of k close to 20 in order to obtain the k-mers.
An approach based on genetic algorithm optimization was suggested by Parsons in 1993 [8]. Later, other scientists applied similar metaheuristics [9,10,11,12], but their results were based on very small benchmarks, the largest had about 1000 fragments, and even though results were improving slowly, they were still far from real problems, including the long reading ones, with tens of thousands of fragments. In 2013, a reduction from fragment DNA sequencing to the Traveling Salesman Problem (TSP) was used [13]. Taking advantage of formal heuristics and algorithms for the TSP, optimal values were found for the most commonly used benchmarks, and, for the first time, a real life problem was solved using optimization.
The remainder of this paper is organized as follows: Section 2 provides the basic ideas on the use of graph theory for the solution of DNA sequencing problems; Section 3 explains those algorithms that are necessary to tackle the problem; Section 4 illustrates the use of our algorithms on real life problem benchmarks; and, in Section 5, we give our conclusions and future work.

2. Use of Graph Theory

2.1. Generalities

In reference [13], a graph theory approach was used by means of the TSP, where the solution obtained from the assembly of fragments is a series of contigs rather than a Hamiltonian Path. In TSP, all possible edges of the graph are considered, but when there is no possible connection between two nodes, it is usually represented using a very large weight to prevent it from showing up in the final solution. Applying TSP to DNA fragment assembly might be considered excessive, even though the expected results are obtained, since the fraction of real connections is very small. A graph theoretical approach using other algorithms might give better results.
The size of real life DNA fragment assembly problems is huge, with 35 base fragments and a coverage of 40, we get about 700 million fragments for human chromosome number 1, which has 245 million base pairs. It is known that the DNA fragment assembly problem is NP-hard (Non-deterministic Polynomial time hard), since it can be reduced from the shortest common superstring problem [14]; in practice, we must only use linear time algorithms, even if by doing so we sacrifice correctness and obtain only an approximate solution.

2.2. DNA Fragment Assembly as a Graph

The DNA fragment assembly problem can be transformed into a directed graph: we need to find a sequence of fragments where each one is always the prefix of the next one. We know that DNA comes in pairs of strands (double helix) that run in opposite directions, where in front of each base of one strand, there is a base of the reverse-complement on the other; for instance, an A in front of a T and a C in front of a G, which clearly means that the information contained in each one of the strands is the same. When DNA fragment overlaps are obtained, it is not known what strand they come from, i.e., every time that an overlap is computed, the strand and its reverse-complement must be considered. This means that a prefix of a normal overlap becomes a suffix of the reverse-complement. At first, it might seem that this case could be represented by a simple non-directed edge between the two fragments. However, this is not so if we consider that each node or fragment that is a reverse-complement must also be a reverse-complement for every edge that connects to it.
As a small example, we present a set of 13 fragments, Table 2, the associated overlap matrix, Table 3 and the resulting graph, Figure 2.
Table 2. Graph example: fragments.
Table 2. Graph example: fragments.
Fragment
number
Sequence
1GTGTACCACGTACTGATGTACTATTTGAAGCTTAT
2CCCAATTCCTAATGTACTATTTGAAGCTTATTCGG
3CATAAGCTTCATGATGAAGCTTATTCGGCCAATCG
4TTTGATTCCTGCTGATGTACTATTTGATGAAGCTT
5ATGTACTATTTGAAGCTTATTCGGCCAATCGTACT
6GAAGCTTATTCGGCCAATCGTACTGATGTACTATT
7CTTATTCGGCCAATCGTACTATTTACTGATGTACA
8TGATGAAGCTTATTCGGCCAATCGTACTGATGTAC
9GGCCAATCGTACTGATGTACTATTTGATGAAGCTT
10CTGATGTACTATTTGATGAAGCTTATTCGGCCAAT
11TGTACTATTTGATGAAGCTTATCAGTACGTGGAAC
12AATCGTACTGATGTACTATTTACTGATGTACAATA
13CTATTTACTGATGTACAATAGTACATCAGTAAAAA
Table 3. Graph example: overlaps matrix.
Table 3. Graph example: overlaps matrix.
12345678910111213
1 20
2 24
3 20 24
4 24
5 2020
6 24 20
7 17
8 20
9 2420
10 22
11
12 20
13
Figure 2. Graph example: graph.
Figure 2. Graph example: graph.
Algorithms 08 00754 g002

2.3. Objective Function

In the literature, using nature inspired algorithms to solve the DNA fragment assembly problem published after Parsons [8], the associated graph has been considered to be an undirected graph. Other literature considered bi-directed graphs, especially when De Bruijn graphs were proposed. It is common to use a Hamiltonian path, even though sometimes Eulerian paths have been used too [7]. In any case, it is not entirely clear how many times each one of the fragments should be used, since, from the physical point of view, it is feasible for a sequence to repeat in different sections of the DNA, and for identical fragments to come from different places. Thinking that a fragment and it reverse-complement are the same, creates too much complexity from the algorithmic point of view as well as from the programming one, and would only be justified as long as it was strictly required to consider Hamiltonian paths. Hence, we consider each reverse-complement fragment as an independent fragment. Another possibility might be to trust the huge cover that is customary in Next Generation Sequencing (NGS) and not use reverse-complements.
With respect to the objective function for the optimization, in many papers, the sum of the overlaps of each fragment with the next one is used, which has probably been inherited from the use of some greedy algorithms. In addition to this objective function, we propose to maximize the sum of the lengths of the contigs. This can be achieved by considering that each time that a new fragment is added at the end of a contig, its length is incremented by an amount equal to the length of the fragment minus the overlap length. So we would have the following objective functions:
max F 1 = i = 1 n 1 w ( i , i + 1 )
max F 2 = i = 1 n 1 ( L w ( i , i + 1 ) )
where w(i, i + 1) is the number of overlapped bases between fragment i and the next one in the sequence, n is the number of fragments, and L is the length of the fragment. Notice that Equation (2) can easily be transformed into a minimization:
min F 2 ' = i = 1 n 1 w ( i , i + 1 )
Because
F 2 = ( n 1 ) L F 2 '
The boundary between contigs appears because there are values of w(i, i + 1) that are equal to 0. Since there is usually some contamination with foreign DNA and there might be chimeras, some of the obtained contigs do not belong to the original DNA, and, at some point, which is not within the scope of this paper, useful contigs must be separated from the useless ones.
In order to obtain each contig, an agreement must be reached (computed) for each base, i.e., since we know that there might be redundancy for each base (due to the original cover) and that there might be sequencing errors in the fragments, it is possible for each base in each contig to have several versions, from which the most frequent value must be accepted, even if it is the minimum acceptable value. Since most of the errors in the contigs are at the ends, due to chemical problems in the sequencing equipment or because of the lack of enough redundancy, it might be necessary to cut them off.

3. Algorithms

3.1. Basic Algorithm for F1 and F2

For the objective function F1, we propose an algorithm similar to topological sorting. Let G = (V, E) be a directed acyclic graph (DAG) where V is the set of vertices, E is the set of edges and w(u, v) is the weight of edge u→v. In this graph, the longest path will go from a node that has no edge going in (initial vertex) and those nodes that are connected to it. Notice that there might be several “initial vertices”, and that there is no single privileged node. Let S V be the set of vertices with in-degree din(v) = 0 and let Q be a stack. The algorithm to determine the longest distance from a given vertex to each of the other vertices is the following (Algorithm 1):
Algorithm 1. Longest distance from initial vertices
1.  For each vertex v S with din(v) = 0
  1.1.  push(Q,v)
  1.2.  d(v):=0
  1.3.  origin(v):=v
2.  While Q is not empty
  2.1.  u:=pop(Q)
  2.2.  for each vertex v ( u , v ) E
   2.2.1.  din(v):=din(v)−1
   2.2.2.  if din(v)=0
    2.2.2.1.  push(Q,v)
    2.2.2.2.   d ( v ) : = max ( x , v ) E ( d ( x ) + w ( x , v )
    2.2.2.3.  father(v):=x
This algorithm only works if the graph is acyclic. In real life DNA fragment assembly problems, there are cycles. When this algorithm is applied to a graph with cycles, neither the cycles nor the nodes that are further from the cycles are detected. Later we will discuss how to take care of cycles.

3.2. Constant Time Heap

Aside from a modified version of the algorithm of Section 3.1, there are other ways to tackle the cycle problem. One way, not necessarily the best way, would be to build a minimum spanning tree, even if it ignores the direction of the edges. It is clear that if there are no undirected cycles in the equivalent undirected graph, there should be no directed cycles in its directed version. So we could use one of several algorithms like Prim [15] and Kruskal [16], both with time complexity in O(|E| log|V|) when using a Fibonacci Heap.
In order to achieve the linear time complexity that we require, we notice that in the case of the DNA fragment sequencing problem, we should use a heap that can insert and extract-min in constant time, thus improving the time complexity of other data structures where the extract-min operation is in O(log n).
Our heap requires that the ordering values are integers independent from the number of nodes in the graph. In the DNA fragment sequencing problem, the edges of the graph are the fragment overlaps, hence it is an integer value representing the number of bases, and its value is also bounded by 0 < w(u, v) < L, where L does not depend of the number of fragments, and has a value of at most a few hundred bases. The heap that we propose is based on a Stack Pi for each possible value to be introduced. The heap operations are:
Algorithm 2. Stack heap insert and extract min
insert x
  1. push(Px,x)
  2. if x<xmin
    2.1. xmin:=x
  3. if x > xmax
   3.1. xmax:=x
 
extract-min
  1. x:=pop(Pxmin)
  2. While the stack Pxmin is empty and xmin ≤ xmax
   2.1.xmin: = xmin + 1
In the extract-min operations shown, the value that is in the stack corresponding to xmin is extracted. If the stack is empty, the next non-empty stack smaller than the maximum value is used. When the number of stacks is large enough, it is useful to use some other kind of heap, such as a binary or a Fibonacci heap instead of sequential search. In the example that we provide in Section 4, we use no more than 34 stacks (0 ≤ x ≤ 34), hence sequential search turns out to be faster. The time complexity of the insert operation is constant because Steps 1 through 3.1 always require the same amount of time. In the extract-min operation, Step 1 is of constant time and Step 2, which is only executed when the stack with the minimum values is empty, in the worst case depends on the value xmax, and could be linear, in the case of sequential search, or logarithmic if a heap is used. In the case of DNA fragment assembly, millions of edges or nodes will be inserted into the heap, hence the value of xmax should be close to a few hundred, which makes the probability of Step 2 executing more than once negligible. In any case, since the time complexity of our heap is independent from the number of nodes or vertices inserted into it, then Prim’s algorithm [15] as well as Kruskal’s [16] algorithm become linear. Even though our technique is designed to work in the particular case of DNA fragment assembly, it is possible that it could also be used in other problems.

3.3. MST in Linear Time

As mentioned before, when using our heap, Algorithm 2, Prim’s algorithm as well as Kruskal’s algorithm become linear, and even though it is possible to implement operations such as delete-key and decrease-key in our heap, it is more convenient to allow the edge values to create a cycle and discard them when leaving the heap, since this is more efficient and is also of constant time. Let M(F, V) be the MST of graph G(E, V) and let S be the set of vertices that have been added to the MST at some point during the execution of Prim’s algorithm. Here is our version of Prim’s algorithm:
Algorithm 3. Prim’s algorithm
1. For some vertex u
 1.1. Put u in S
 1.2. For each ( u , v ) E
   1.2.1. insert w(u,v) in the heap
 
2. While the heap is not empty
 2.1. Extract the edge (u, v) from the heap
   2.1.1. if u S
    2.1.1.1. Put u in S and (u,v) in F
     2.1.1.1.1. For each ( u , v ) E insert w(u,x) in the heap
   2.1.2. if v S
    2.1.2.1. Put v in S and (v,u) in F
     2.1.2.1.1. For each ( v , x ) E insert w(v,x) in the heap
Since at most |E| edges will be inserted and extracted in constant time, Algorithm 3 will have a time complexity in O(|E|).
Now, our version of Kruskal’s algorithm (Algorithm 4):
Algorithm 4. Kruskal’s algorithm
1. Insert E in a heap in increasing order of w(u,v)
2. Create a forest F where each vertex is an independent tree
3. While the heap is not empty
  3.1. Extract (u,v) from heap
   3.1.1. If u and v belong to different trees, merge both trees
In order to merge the trees, we use union by rank with path compression [17]. The time complexity of Step 1 is O(|E|), Step 2 is O(|V|). Step 3 is repeated |E| times and every time the following operations are performed: extraction of the minimum value from the heap, a find of two nodes, and in the worst case, the union of two trees. Find is in O(1) and the union is O(log*|E|) [17], which for problems with 65,535 edges up to the maximum physical size of DNA fragment assembly problems is 5; therefore, in practice the algorithm is in O(1). The total running time of Kruskal’s algorithm using our heap, Algorithm 2, with union by rank and path compression, applied to the DNA fragment assembly problem is in O(|E| + |V|). Once the MST has been computed, its maximum distances are computed using Algorithm 1. In this case, there are no cycles (directed or undirected), so Algorithm 1 works correctly.

3.4. Modification of the Basic Algorithm

There are two kinds of cycles that cause trouble: cycles that are connected to a start node, and disconnected cycles. In both cases, there might be sets of nodes that are created like attachments to nodes of the cycle, but do not belong to them, and are not detected by Algorithm 1. In the case of disconnected cycles, we would need to take any node as a start node, which can require some trial and error since a randomly selected node could be attached to the cycle but not necessarily in the cycle itself. However, disconnected cycles in real life DNA fragment sequencing problems are rare, and when they appear, they are usually very small, two or three nodes, that lead to contigs that are too small and that would have been discarded anyway.
A cycle connected to a start node of the graph appears when there are nodes for which, even when paths that go from a start node to another node u in the cycle, the value of din(u) never goes to zero because at least one edge comes from the cycle. When Algorithm 1 is done, we can take one of the nodes that went through Step 2.2.1, but that remained with a final value of din(u) > 0, and insert it into the stack to continue with Step 2 of the algorithm. It is necessary, however, to mark those nodes so that they will not be considered again and to avoid the cycle being traversed more than once. The total distance from a start node to the last processed node in the cycle is the sum of the distance from a start point to the cycle entry point plus the length of the cycle, minus the edge that would close the cycle on the start node in the cycle. So we have that if the selected node is s, the previous node in the cycle is t, and the total length of the cycle is C then:
d ( t ) = d ( s ) + C w ( t , s )
Among the distinct entry nodes to the cycle, we could select one with the longest distance to a start node. However, it does not guarantee that we will always find the longest distance to node t since the value of w(t, s) changes with the selection of s and there could be another cycle entry node with a smaller value of d(s) but with a small enough value of w(t, s) to give a longest distance. In the case of the DNA fragment assembly problem the values of w( ) are in a very narrow range, most of the distances to the start node are big and it is uncommon for them to be similar, so that if our heuristic consists of selecting an s with the longest distance to the start point, we will have a high probability of finding the maximum distance. The modified algorithm (Algorithm 5) is as follows:
Algorithm 5. Modified maximum distance algorithm
1. For each v S vertex with din(v) = 0
  1.1. push(Q,v)
  1.2. d(v):=0
  1.3. origin(v):=v
2. While there are vertices v with din(v) > 0
  2.1. While Q is not empty
   2.1.1. u:=pop(Q)
   2.1.2. For every vertex v where ( u , v ) E and v is not marked
    2.1.2.1. din(v):=din(v)-1
    2.1.2.2. If din(v)=0
     2.1.2.2.1. push(Q,v)
     2.1.2.2.2.  d ( v ) : = max ( x , v ) E ( d ( x ) + w ( x , v )
     2.1.2.2.3. insert d(v) in a heap
     2.1.2.2.4. father(v):=x
     2.1.2.2.5. origin(v):=origin(x)
  2.2. If the heap is not empty
   2.2.1. extract the node v from heap
    2.2.1.1. If din(v)>0
     2.2.1.1.1. Make din(v)=0
     2.2.1.1.2. Insert v in the stack
     2.2.1.1.3. Mark the node v
     2.2.1.1.4. Go to step 2.1
Notice that Step 2.2.1.1 is required since node v could have gone through Step 2.1.1 and already be in a path. In this case, we can use the heap to keep the time complexity of the algorithm linear.

3.5. Assembly Algorithm

The output from Algorithms 1 and 4 is a sequence of directed edges and nodes that allow us to directly build the contigs. Considering reverse-complement fragments as independent of the original ones makes the job easier. The reconstruction of the sequence of DNA in the contig consists of placing the first fragment, and then adding the overlap-free portion of the next fragment. In our method, we do not need to reach a computed agreement as in other methods [18] since we do not differentiate among overlapped bases from one or other fragment. However, the ends of the contig might have sequencing errors and there is nothing to verify that there are no such errors. The solution is to cut off the ends in such a way that a minimum agreement is reached, i.e., that the bases must appear in at least a given number of fragments.

4. Experiments

In order to test our algorithms, we used several benchmarks that are available for use by researchers [19]. We selected the benchmarks of S. aureus MW2 with short reads. The benchmarks that we used have the following number of fragments: 2,278,504, 730,201, 298,194, 89,718 and 22,447. From now on, we will refer to each one of these problems by its number of fragments. In order to have something to compare our results to, we ran the Edena assembler using the same data, Table 4. This comparison has some limitations due to the fact that Edena uses a series of heuristics in order to refine its solution, by eliminating contigs that have no real meaning, while our results are those directly generated by our algorithms with no refinement. Refining the raw results from our algorithms might be part of our future research.
Table 4. Edena benchmark results.
Table 4. Edena benchmark results.
Problem
Fragments22,44889,718298,194730,2012,278,504
Contigs Obtained1862585758372046
Total Number of Bases26,284102,684314,120788,1982,526,278
Contigs FoundContigs Found1842565658231920
%98.999.298.398.393.8
Bases Found26,024102,249310,965777,5832,463,563
%9999.69998.797.5
N501881247291141325250
Average Length141.4399.4550.4944.81283.10
Minimum Length5252525252
Maximum Length880729510,98520,52120,579
Contigs not FoundContigs not Found221014126
%1.10.81.71.76.2
Bases not Found260435315510,61562,715
%10.411.32.5
N50NANANANA5599
Average Length130217.5315.5758.2497.7
Minimum Length9999565652
Maximum Length1613362521594118,158
From each benchmark, we took only the fragment file in FASTA format, and from it we computed the overlaps among all fragments considering that reverse-complement fragments are also independent fragments (In the original benchmarks, there is a file of overlaps where reverse-complements are not considered as independent fragments). To compute overlaps, we used a trie (prefix tree). The computed overlaps are all larger than 20 bases. From these overlaps other sets of test data were obtained: one discarding redundant overlaps because of transitivity, and the other one, computing the MST. In all of the cases, the minimum agreement in Algorithm 5 was 4. The contigs obtained in each experiment were searched in the complete genome [20].
In Experiment 1, we used all of the overlaps larger than 20 bases and executed Algorithm 1 for the two objective functions. The results are presented in Table 5, Table 6 and Table 7.
Table 5. Experiments summary.
Table 5. Experiments summary.
Problem
Algorithm Fragments22,44889,718298,194730,2012,278,504
EdenaEdenaContigs Found1842565658231920
%98.999.298.398.393.8
Bases Found26,024102,249310,965777,5832,463,563
%9999.69998.797.5
Algorithm 1F1 Objective Function, all Overlaps.Contigs Found74101149204480
%90.288.691.491.194.1
Bases Found15,82270,250206,826463,5821,102,152
%81.88386.183.983.4
Algorithm 1F2 Objective Function, all OverlapsContigs found74102146189450
%90.289.589.684.488.2
Bases found15,51570,843189,927369,001931,343
%81.583.87966.869.9
Algorithm 1F1 Objective Function, no Transitive overlapsContigs found74102149204481
%90.289.591.491.193.9
Bases found15,80671,652206,826462,7741,103,587
%83.784.686.183.982.7
Algorithm 1F2 Objective Function, no Transitive OverlapsContigs found75103142181417
%91.590.487.180.881.4
Bases found15,59972,245174,244320,502791,376
%8485.472.558.158.9
Algorithm 1F1 Objective Function, MSTContigs found3041827598015,44346,187
%919798.498.995.7
Bases found35,980256,641815,9942,081,3066,208,258
%89.896.698.198.793.9
Algorithm 5F1 Objective function, no Transitive EdgesContigs found74102150205480
%90.288.791.591.194.1
Bases found15,82270,318206,927463,6841,102,152
%81.88386.183.983.4
Table 6. Algorithm 1, F1 objective function, all overlaps.
Table 6. Algorithm 1, F1 objective function, all overlaps.
Problem
Fragments22,44889,718298,194730,2012,278,504
Contigs Obtained82114163224510
Total Number of Bases19,34384,659240,314552,4081,321,969
Contigs FoundContigs Found74101149204480
%90.288.691.491.194.1
Bases Found15,82270,250206,826463,5821,102,152
%81.88386.183.983.4
N503601614332649374499
Average Length213.8695.51388.102272.502296.20
Minimum Length5458546253
Maximum Length882726110,94322,74524,396
Contigs not FoundContigs not Found813142030
%9.811.48.68.95.9
Bases not Found352114,40933,48888,826219,817
%18.21713.916.116.6
N5075823083954725811,798
Average Length440.11108.402392.004441.307327.20
Minimum Length959595181255
Maximum Length82637036809954223,614
Table 7. Algorithm 1, F2 objective function, all overlaps.
Table 7. Algorithm 1, F2 objective function, all overlaps.
Problem
Fragments22,44889,718298,194730,2012,278,504
Contigs Obtained82114163224510
Total Number of Bases19,02884,584240,280552,5091,331,757
Contigs FoundContigs Found74102146189450
%90.289.589.684.488.2
Bases Found15,51570,843189,927369,001931,343
%81.583.87966.869.9
N503561527318442984201
Average Length209.7694.51300.901952.402069.70
Minimum Length5458545953
Maximum Length882726110,94321,18516,135
Contigs not FoundContigs not Found812173560
%9.810.510.415.611.8
Bases not Found351313,74150,353183,508400,414
%18.516.22133.230.1
N507582308633676209472
Average Length439.11145.102961.905243.106673.60
Minimum Length9596103181251
Maximum Length8263703762622,74524,396
From the DNA fragment assembly point of view, we are interested in large contigs with no errors. The most important elements to be computed are the number of bases and the contigs that were found in the complete genome, as well as the mean length of each contig. When we compare the results obtained by Algorithm 1 for F1 and F2 using all overlaps, we find that F1 is a little bit better than F2, except in the number of bases found for problem 89,718. In terms of multi objective function optimization, the solution obtained for F1 is not enough to dominate the solution obtained by F2, but since it is better in most of the test, we can say that it is almost dominant. In the other experiments, we find a similar situation, where F1 is almost dominant with respect to F2.
For Experiment 2, we excluded transitive overlaps and used Algorithm 1 for F1 and F2. The results are given in Table 5, Table 8 and Table 9.
Table 8. Algorithm 1, F1 objective function, no transitive overlaps.
Table 8. Algorithm 1, F1 objective function, no transitive overlaps.
Problem
Fragments22,44889,718298,194730,2012,278,504
Contigs Obtained82114163224512
Total Number of Bases18,87984,646240,314551,6001,333,881
Contigs FoundContigs Found74102149204481
%90.289.591.491.193.9
Bases Found15,80671,652206,826462,7741,103,587
%83.784.686.183.982.7
N503601527332649374499
Average Length213.6702.51388.102268.502294.40
Minimum Length5458546253
Maximum Length882726110,94322,74524,396
Contigs not FoundContigs not Found812142031
%9.810.58.68.96.1
Bases not Found307312,99433,48888,826230,294
%16.315.413.916.117.3
N5053923083954725811,657
Average Length384.11082.802392.004441.307428.80
Minimum Length959595181255
Maximum Length82637036809954223,614
In general, we found that F1 is better than F2 without dominating it. Comparing the results where transitive overlaps are excluded (Table 8 and Table 9 or Table 5) to those where they are not (Table 6 and Table 7 or Table 5), we can see that the results are similar for F1, hence it is not clear whether it is better to exclude or not to exclude transitive overlaps. There is, however, an obvious advantage of excluding transitive overlaps, which is that the problem size is reduced and requires less computational resources. So, our recommendation is to exclude transitive overlaps.
Comparing our results so far to Edena, we find out that we obtained fewer correct bases and more contigs that are not in the genome, but the length of our contigs is better.
Table 9. Algorithm 1, F2 objective function, no transitive overlaps.
Table 9. Algorithm 1, F2 objective function, no transitive overlaps.
Problem
Fragments22,44889,718298,194730,2012,278,504
Contigs Obtained82114163224512
Total Number of Bases18,57984,571240,265551,7011,343,637
Contigs FoundContigs Found75103142181417
%91.590.487.180.881.4
Bases Found15,59972,245174,244320,502791,376
%8485.472.558.158.9
N503561527298439104135
Average Length208701.41227.101770.701897.80
Minimum Length5458545953
Maximum Length8827,26110,94310,64816,135
Contigs not FoundContigs not Found711214395
%8.59.612.919.218.6
Bases not Found298012,32666,021231,199552,261
%1614.627.541.941.1
N505392308633681357554
Average Length425.71120.503143.905376.705813.30
Minimum Length9596103181251
Maximum Length826370310,64822,74524,396
Table 10. Algorithm 1, F1 objective function, Minimum Spanning Tree (MST).
Table 10. Algorithm 1, F1 objective function, Minimum Spanning Tree (MST).
Problem
Fragments22,44889,718298,194730,2012,278,504
Contigs Obtained3341883608015,61848,263
Total Number of Bases40,076265,584831,4492,109,0706,613,747
Contigs FoundContigs Found3041827598015,44346,187
%919798.498.995.7
Bases Found35,980256,641815,9942,081,3066,208,258
%89.896.698.198.793.9
N50135165154153153
Average Length118.4140.5136.5134.8134.4
Minimum Length6262615757
Maximum Length4466607478941152
Contigs not FundContigs not Found30561001752076
%931.61.14.3
Bases not Found4096894315,45527,764405,489
%10.23.41.91.36.1
N50154199198194245
Average Length136.5159.7154.6158.7195.3
Minimum Length6465656463
Maximum Length3624814814931130
In Experiment 3, we obtained the Minimum Spanning Tree (MST) from all overlaps using our version of Kruskal’s algorithm. We only show the results for F1 (Table 5 and Table 10). In this case, we were able to find much more bases in the genome than in previous experiments, including Edena, but the size of our contigs is small. There appears to be a tradeoff between the number of bases that can be found and contig length.
In other experiments, we used Algorithm 4 without transitive overlaps, obtaining similar results to those of Algorithm 1 for F1 (F1 was again better than F2). Analyzing the contigs obtained by the algorithm, we found that if the contig is split at the edge that joins the cycle with the path, we have a better chance of finding both contigs. Therefore, we modified Algorithm 4 making d(v) = 0 in Step 2.2.1.1 and repeated Experiment 4 without transitive overlaps. In this case, the objective function F1 was better than F2 without being dominant. These results are given in Table 5 and Table 11.
Table 11. Algorithm 5, F1 objective function, no transitive edges.
Table 11. Algorithm 5, F1 objective function, no transitive edges.
Problem
Fragments22,44889,718298,194730,2012,278,504
Contigs Obtained82115164225510
Total Number of Bases19,34384,727240,415552,5101,321,969
Contigs FoundContigs Found74102150205480
%90.288.791.591.194.1
Bases Found15,82270,318206,927463,6841,102,152
%81.88386.183.983.4
N503601614332649374499
Average Length213.8689.41379.502261.902296.20
Minimum Length5458546253
Maximum Length882726110,94322,74524,369
Contigs not FoundContigs not Fund813142030
%9.811.38.58.95.9
Bases not Found352114,40933,48888,826219,817
%18.21713.916.1
N5075823083954725811,798
Average Length440.11108.402392.004441.307327.20
Minimum Length959595181255
Maximum Length82637036809954223,614
In an analysis of those contigs that were not found in the genome, we considered the possibility of splitting them in two or more pieces hoping that all of them could be found. There are few cases in which contigs should be divided; in problem 730,201, using the modified Algorithm 4 with F1, there were 513 contigs from which 414 were found directly, 76 were split in two to be found, 14 in three, 10 in four or more pieces, and two produced pieces that were too small to be considered. With appropriate rules, it should be possible to find the correct split points in most of the cases, as other assemblers do. We also analyzed residual paths, after removing paths of maximum length, from which we considered the possibility of recovering several long contings, which we will do in the future.

5. Conclusions and Future Work

In order to assemble DNA, it is possible to use traditional graph theory concepts being careful to use only linear algorithms due to the large amount of data that is handled. With this idea, we developed a constant time heap, appropriate for this particular case (and maybe to other cases as well), which allowed us to reduce the time complexity of other algorithms, such as Prim’s and Kruskal’s, making them linear.
If we compare the results of the objective functions F1 and F2, even though one does not dominate over the other, better results are obtained using F1, hence we recommend its use. We also recommend the removal of transitive overlaps because of resource reduction considerations, even though the results of the assembly are almost the same for function F1. The use of the MST considerably improves the number of correct bases, but the contigs obtained in this way are too small. It can be seen from the experiments that there is a tradeoff between contig length and the number of bases detected. Comparing our results to those of Edena is not completely objective since we do not apply heuristics after the algorithms are executed in order to refine the solutions, but Edena (as well as the Velvet assembler) do so intensively. In any case, compared to Edena, our method produced, in some cases, less bases but longer contigs, while, in other cases, we found more bases, but shorter contigs, especially when using MST. Therefore, the problem must be considered as a multi objective optimization, with an objective function that is able to maximize the number of mean contig length, giving the user a Pareto Front from which he can obtain the most convenient solution according to his particular criteria.
Our future work includes:
  • Developing rules to split contigs in such a way that most of the pieces can be found.
  • Attempting to recover other contigs after extracting the longest ones.
  • Implementing a solution to DNA fragment assembly as a multi objective optimization problem.

Acknowledgments

This work has been developed with the support of Universidad Iberoamericana.

Author Contributions

Guillermo Mallén-Fullerton started the idea and developed the algorithms. Emilio Quiroz-Ibarra developed test programs and provided helpful observations about the implementation problems associated with some of the algorithms. Antonio Miranda and Guillermo Fernández-Anaya provided algorithms complexity analysis and observations to improve it. Guillermo Fernández-Anaya collaborated in the experiments design and interpretation and Antonio Miranda wrote the English version of the article. All authors approved the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Watson, J.D.; Crick, F.H. Molecular structure of nucleic acids. Nature 1953, 171, 737–738. [Google Scholar] [CrossRef] [PubMed]
  2. Sanger, F.; Nicklen, S.; Coulson, A.R. DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. 1977, 74, 5463–5467. [Google Scholar] [CrossRef] [PubMed]
  3. Staden, R. A strategy of DNA sequencing employing computer programs. Nucl. Acid. Res. 1979, 6, 2601–2610. [Google Scholar] [CrossRef]
  4. Van Belkum, A.; Scherer, S.; van Alphen, L.; Verbrugh, H. Short-sequence DNA repeats in prokaryotic genomes. Microbiol. Mol. Biol. Rev. 1998, 62, 275–293. [Google Scholar]
  5. Salzberg, S.L.; Phillippy, A.M.; Zimin, A.; Puiu, D.; Magoc, T.; Koren, S.; Yorke, J.A. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Gen. Res. 2012, 22, 557–567. [Google Scholar] [CrossRef] [PubMed]
  6. Shendure, J.; Ji, H. Next-generation DNA sequencing. Nat. Biotechnol. 2008, 26, 1135–1145. [Google Scholar] [CrossRef] [PubMed]
  7. Pevzner, P.A.; Tang, H.; Waterman, M.S. An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. 2001, 98, 9748–9753. [Google Scholar] [CrossRef] [PubMed]
  8. Parsons, R.J.; Forrest, S.; Burks, C. Genetic algorithms for DNA sequence assembly. In Proceedings of the First International Conference on Intelligent Systems for Molecular Biology (ISMB), Bethesda, MD, USA, 6–9 July 1993; pp. 310–318.
  9. Krause, J.; Cordeiro, J.; Parpinelli, R.S.; Lopes, H.S. A Survey of Swarm Algorithms Applied to Discrete Optimization Problems. In Swarm Intelligence and Bio-inspired Computation: Theory and Applications; Elsevier Science Publishers: Amsterdam, The Netherlands, 2013. [Google Scholar]
  10. Alba, E.; Luque, G. A new local search algorithm for the DNA fragment assembly problem. In Evolutionary Computation in Combinatorial Optimization; Springer: Berlin/Heidelberg, Germany, 2007; pp. 1–12. [Google Scholar]
  11. Luque, G.; Alba, E. Metaheuristics for the DNA fragment assembly problem. Int. J. Comput. Intel. Res. 2005, 1, 98–108. [Google Scholar] [CrossRef]
  12. Firoz, J.S.; Rahman, M.S.; Saha, T.K. Bee algorithms for solving DNA fragment assembly problem with noisy and noiseless data. In Proceedings of the 14th ACM Annual Conference on Genetic and Evolutionary Computation, Philadelphia, PA, USA, 7–11 July 2012; pp. 201–208.
  13. Mallen-Fullerton, G.M.; Fernandez-Anaya, G. DNA fragment assembly using optimization. In Proceedings of the 2013 IEEE Congress on Evolutionary Computation (CEC), Cancun, Mexico, 20–23 June 2013; pp. 1570–1577.
  14. Gallant, J.; Maier, D.; Astorer, J. On finding minimal length superstrings. J. Comput. Syst. Sci. 1980, 20, 50–58. [Google Scholar] [CrossRef]
  15. Prim, R.C. Shortest connection networks and some generalizations. Bell Syst. Tech. J. 1957, 36, 1389–1401. [Google Scholar] [CrossRef]
  16. Kruskal, J.B. On the shortest spanning subtree of a graph and the traveling salesman problem. Proc. Am. Math. Soc. 1956, 7, 48–50. [Google Scholar] [CrossRef]
  17. Hopcroft, J.E.; Ullman, J.D. Set merging algorithms. SIAM J. Comput. 1973, 2, 294–303. [Google Scholar] [CrossRef]
  18. Bonfield, J.K.; Smith, K.; Staden, R. A new DNA sequence assembly program. Nucl. Acid. Res. 1995, 23, 4992–4999. [Google Scholar] [CrossRef]
  19. Mallén-Fullerton, G.M.; Hughes, J.A.; Houghten, S.; Fernández-Anaya, G. Benchmark datasets for the DNA fragment assembly problem. Int. J. Bio-Inspir. Comput. 2013, 5, 384–394. [Google Scholar] [CrossRef]
  20. Staphylococcus aureus subsp. aureus MW2 DNA, complete genome, GenBank: BA000033.2. Available online: http://0-www.ncbi.nlm.nih.gov.brum.beds.ac.uk/nuccore/47118312?report=fasta (accessed on 3 June 2015).
Back to TopTop