Date of Completion
5-5-2017
Embargo Period
10-27-2017
Keywords
Closest Pair Problem (CPP), Error Correction, Feature Selection, Genome-wide Association Study (GWAS), Hierarchical Clustering, Metagenomics, Scaffolding, Sequence Compression, Spliced Junctions, Time Series Motifs
Major Advisor
Sanguthevar Rajasekaran
Associate Advisor
Chun-Hsi (Vincent) Huang
Associate Advisor
Ion Mandoiu
Associate Advisor
Mohammad Maifi Hasan Khan
Field of Study
Computer Science and Engineering
Degree
Doctor of Philosophy
Open Access
Open Access
Abstract
In this dissertation we offer novel algorithms for big data analytics. We live in a period when voluminous datasets get generated in every walk of life. It is essential to develop novel algorithms to analyze these and extract useful information. In this thesis we present generic data analytics algorithms and demonstrate their applications in various domains.
A number of fundamental problems, such as clustering, data reduction, classification, feature selection, closest pair detection, data compression, sequence assembly, error correction, metagenomic phylogenetic clustering, etc. arise in big data analytics. We have worked on some of these fundamental problems and developed algorithms that outperform the best prior algorithms. For example, we have come up with a series of data compression algorithms for biological data that offer better compression ratios while reducing the compression and decompression times drastically. As another example, we have invented an efficient algorithm for the problem of closest pairs. This problem has numerous applications. Our algorithm when applied to solve the two-locus problem in Genome-wide Association Studies performs two orders of magnitude faster than the best-known prior algorithm for solving the two locus problem. As another example, we have proposed a novel deterministic sampling technique that can be used to speed up any clustering algorithm. Empirical results show that this technique results in a speedup of more than an order of magnitude over exact hierarchical clustering algorithms. Also, the accuracy obtained is excellent. In fact, on many datasets, we get an accuracy that is better than that of exact hierarchical clustering algorithms!
Recommended Citation
Saha, Subrata, "Novel Algorithms for Big Data Analytics" (2017). Doctoral Dissertations. 1481.
https://digitalcommons.lib.uconn.edu/dissertations/1481