Date of Completion
7-19-2013
Embargo Period
7-19-2013
Major Advisor
Sanguthevar Rajasekaran
Associate Advisor
Ion Mandoiu
Associate Advisor
Yufeng Wu
Associate Advisor
Reda Ammar
Field of Study
Computer Science and Engineering
Degree
Doctor of Philosophy
Open Access
Campus Access
Abstract
The rapid growth of data in bioinformatics and biomedical informatics brings new challenges to these areas. In this thesis, we present efficient computational algorithms for big data processing in data integration and motif search.
Data integration, or record linkage, is the problem of identifying information pertaining to the same entity, existing in different data sources, in the absence of a global identifier. For instance, there could be multiple records for the same individual with different healthcare providers. Several algorithms have been proposed in the literature that are adept in integrating records from two different datasets. However, limitations show up when facing multiple (more than two) data sources. More often than not we have to deal with much more than two datasets. We propose efficient algorithms based on hierarchical clustering to handle massive data from multiple sources.
In motif prediction, minimotifs (also called Short Linear Motifs) are short contiguous peptide pieces of proteins that have a known biological function. Minimotif Miner (MnM) (http://mnm.engr.uconn.edu) is a computational minimotif prediction tool that analyzes protein queries for the presence of minimotifs. The basic algorithm employs sequence matching and checks to see if any of the experimentally validated motifs can be located in the query. It then uses a series of methods (known as {\em filters}) to eliminate possible false-positive predictions. Since the initial version of MnM, the MnM database has grown rapidly and the number of minimotifs has increased from 462 to 294,933. This growth has also resulted in more false positives in our predictions. In our work, we have developed novel filters to address this problem using knowledge of the cellular function and molecular function. Together with other filters of protein protein interaction, frequency score, and surface prediction score, we have developed computational combination of individual filters to significantly increase the accuracy of the minimotif prediction.
Besides, we studied a crucial fundamental operation in bioinformatics and biomedical informatics, the external or out-of-core selection problem. Selection problem is aimed to find the i_th smallest element given a number of input elements. ‘Out-of-core’ refers to the case when the number of input elements is much more than what the core memory can hold. Some applications include noise reduction (e.g., median filters) in signal or image processing, high-breakdown regression in robust statistics, clustering, neural networks, data mining, etc. Note that these applications play an important role in computational biological science. We propose a novel algorithm of no more than (2+epsilon) passes (epsilon being a very small fraction) and compare our algorithms with some of the best existing algorithms.
Recommended Citation
Mi, Tian, "Efficient Techniques for Big Data Processing in Data Integration and Motif Search" (2013). Doctoral Dissertations. 168.
https://digitalcommons.lib.uconn.edu/dissertations/168