Efficient Combinatorial Algorithms for Problems in Sequence and Self Assembly

Date of Completion

January 2011


Applied Mathematics|Biology, Bioinformatics|Computer Science




In this thesis we present efficient algorithms for various combinatorial problems arising in sequence assembly and self assembly . Sequence assembly is a major phase in uncovering the genomic sequence of an organism. Sequence assembly has several underlying combinatorial problems on bi-directed de Bruijn graphs. Existing algorithms to build and operate on these graphs cannot scale with ever increasing volume of sequence assembly data. In this thesis we close this gap by providing efficient algorithms build and operate on bi-directed de Bruijn graphs. We first show how a bi-directed de Bruijn graph can be constructed optimally in Θ(n) time in contrast to the existing [ZB08] Θ(n log( G)) algorithm, here n is the input size and G is the size of genome. This algorithm is also I/O optimal and requires Θ( logn/M logM/B ) I/Os to build the graph, here M is the main memory size and B is the block size. Secondly we show that we can solve the Chinese Postman walk Problem on a bi-directed graph without reducing it to bi-directed flow problem. This bi-directed flow based algorithm [MGMB07] to solve the CPP on a bi-directed graph G( V,E) takes O(:E:2 log2(V)) time. We show that we can improve this algorithm to Θ(p(:V: + : E:) log(:V:) + (dmaxp) 3), here p = max{:{ν:din(ν) − dout(ν) > 0}:, :{ν:din(ν) − dout(ν) < 0}:} and dmax = max{:din(ν) − dout(ν):}. This algorithm performs asymptotically better than the bi-directed flow algorithm when the number of imbalanced nodes p is much less than the nodes in the bi-directed graph. ^ On the other hand self assembly systems have numerous critical applications in medicine, circuit design. Theoretical modeling of self assembly is very useful before performing self assembly experiments. Algorithmic self assembly studies the efficiency of self assembly systems on an abstract two dimensional (2D) tile assembly model (TAM). The theory behind TAM is based on Wang's tiling technique, TAM has the power to simulate a turing machine. Algorithms with an optimal tile complexity of (Θ( logN loglogN )) were proposed earlier to uniquely self assemble an N × N square (with a temperature of α = 2) on TAM. However efficient algorithms (tile set constructions) to assemble arbitrary shapes on TAM are not known and have remained open. In this thesis we try to bridge this gap by presenting algorithms which can self assemble some regular polygons with a tile complexity of Θ(log(N)), here N is the area of the underlying polygon. In a deterministic self assembly model such as TAM, it has been proven that the tile complexity lower bound to self assembly any shape is Θ( logNloglog N ) (inferred from the Kolmogrov complexity), here N is the area of the underlying shape. However designing even Θ( logN loglogN ) unique tiles specific to a shape which needs to be self assembled is still an intensive task. Creating a copy of a tile is much simpler than creating a unique tile. With this constraint in mind probabilistic tile assembly models (PTAM) were introduced—these models are also referred as concentration programming models or randomized self assembly models. These systems have O(1) tile complexity and the concentration of each of the tiles can be varied to produce the desired shape. Existing algorithms [KS08] [Dot09] on PTAM suffer from large underlying constant, this is because all these algorithms adopt sub-tiles which perform binary arithmetic. In contrast to the existing algorithms, in this thesis we show that its possible to self assemble rectilinear shapes on PTAM without using any sub-tiles performing binary arithmetic; We introduce a technique called staircase sampling which can self assemble squares, rectangles and rectangles with constant aspect ratio with high probability (i.e. Ω(1 − 1/nα), for any fixed α > 0), here n is the dimension of the shape which needs to be self assembled. ^