## Date of Completion

4-1-2016

## Embargo Period

3-31-2016

## Keywords

planted motif search, suffix arrays, pattern matching with k mismatches

## Major Advisor

Prof. Sanguthevar Rajasekaran

## Associate Advisor

Prof. Ion Mandoiu

## Associate Advisor

Prof. Yufeng Wu

## Field of Study

Computer Science and Engineering

## Degree

Doctor of Philosophy

## Open Access

Open Access

## Abstract

This thesis studies the following problems:

1. Planted Motif Search. Discovering patterns in biological sequences is a crucial process that has resulted in the determination of open reading frames, gene promoter elements, intron/exon splicing sites, SH RNAs, etc. We study the (l, d) motif search problem or Planted Motif Search (PMS). PMS receives as input n strings and two integers l and d. It returns all sequences M of length that occur in each input string, where each occurrence differ from M in at most d positions. Another formulation is quorum PMS (qPMS), where M appears in at least q% of the strings. We developed qPMS9, an efficient parallel exact qPMS algorithm for DNA and protein datasets.

2. Suffix Array Construction. The suffix array is a data structure that finds numerous applications in string processing problems for both linguistic texts and biological data. The suffix array consists of the sorted suffixes of a string. There are several linear time suffix array construction algorithms known in the literature. However, one of the fastest algorithms in practice has a worst case run time of O(n ^ 2 ). We developed an efficient algorithm called RadixSA that has a worst case run time of O(n log n) and is one of the fastest algorithms to date. RadixSA introduces an idea that may find independent applications as a speedup technique for other algorithms.

3. Pattern Matching with Mismatches. We consider several variants of the pattern matching with mismatches problem. Given a text T = t 1 t 2 · · · t n and a pattern P = p 1 p 2 · · · p m , we investigate the following problems: 1) Pattern matching with mismatches: for every alignment i, 1 ≤ i ≤ n − m + 1 output the distance between P and t i t i+1 · · · t i+m−1 , and 2) Pattern matching with k mismatches: output those alignments i where the distance is at most k. The distance metric used is the Hamming distance. Variants of these problems allow for wild cards in the text or the pattern. For these problems we offer novel deterministic, randomized and approximation algorithms.

Source code relevant to these results is available at https://github.com/mariusmni/.

## Recommended Citation

Nicolae, Marius, "Data Structures and Algorithms for the Identification of Biological Patterns" (2016). *Doctoral Dissertations*. 1044.

https://digitalcommons.lib.uconn.edu/dissertations/1044