Date of Completion
5-5-2017
Embargo Period
5-4-2017
Keywords
Data Mining, Formal Methods, Grammars, CFG, Big Data
Major Advisor
Reda Ammar
Co-Major Advisor
Sanguthevar Rajasekaran
Associate Advisor
Swapna Gokhale
Associate Advisor
Yufeng Wu
Field of Study
Computer Science and Engineering
Degree
Doctor of Philosophy
Open Access
Open Access
Abstract
As data collection technologies are advancing and memory storage costs are declining, volumes of data collected have soared. Scientists and investigators are collecting all possible data in fear of missing out on important information. With the merge of the data collection trend, researchers were studying data mining and analysis to find the most efficient way to data mine. There are various valuable data mining techniques that can be found in literature such as Support Vector Machine (SVM), Neural Networks (ANN), and Formal Methods (Grammars). Grammars are a very valuable in analyzing structured data and describing them in a condense matter. However, not many have used it for data mining even though it has many benefits. In this research we present an approach to data mine big data. First, a grammar is inferred to build a structural model that describes the data. Then, on the next phase, a probabilistic context-free grammar is inferred and a model for a more complex structures. Given an input sequence, the model parses and generates the probability of that data sequence being part of the class based on its structural characteristics. Grammatical concatenation is utilized in case of existing sub-structures within the class’s structural description. The model then accepts, or rejects, the input as part of the data’s class by comparing the probability to a pre-set threshold. Finally, this is applied on a heterogeneous large data set by inferring multiple grammars. After building grammatical model for each class, the algorithm parse multiple points in the large set. It then classifies these data into smaller sets where they share similar structural characteristics using probabilistic grammar. If more than one class accepts the data point, it is associated to the highest ranking class. Biological data, DNAs and Proteins, were used for experimentation in this research.
Recommended Citation
Algwaiz, Aljoharah A., "Formal Modeling and Mining of Big Data" (2017). Doctoral Dissertations. 1373.
https://digitalcommons.lib.uconn.edu/dissertations/1373