Date of Completion
4-18-2019
Embargo Period
4-17-2020
Keywords
variational inference, Hidden Markov model, autoregressive, methylation, EM algorithm, variable selection, missing data
Major Advisor
Haim Bar
Associate Advisor
Nalini Ravishanker
Associate Advisor
Dipak Dey
Field of Study
Statistics
Degree
Doctor of Philosophy
Open Access
Campus Access
Abstract
Current popular methods of methylation data analysis rely on multiple testing where the assumption of independent loci is required. The effects of nearby sites in sequencing are usually ignored. Some methods use Hidden Markov Model (HMM) to model the influence of neighbors. The assumptions of locally homogeneous segments with constant variances (homoscedasticity) or constant autocorrelations for standard HMM are restrictive. When heterogeneity of variances or autocorrelations are introduced and missing values occur, the well-known Baum-Welch algorithm for HMM is not applicable to find the model parameters. In this dissertation, we develop a generalized HMM, where AutoRegression and Missing values are handled simultaneously in HMM (ARM-HMM). To provide fast and accurate inference, a modified expectation maximization algorithm and variational inference are introduced as two kinds of fitting procedures. Further feature extraction and variable selection techniques are developed and compared for adequacy and efficiency in the detection of important biomarkers. Experiments with both simulated and real methylation data show that the proposed ARM-HMM is able to get precise parameter estimations and detect meaningful segments. With carefully chosen variable selection methods, biologically meaningful methylation regions can also be detected.
Recommended Citation
Liu, Kangyan, "Segmentation, Feature Extraction and Selection in Sequential Data with Missing-Data Imputation" (2019). Doctoral Dissertations. 2134.
https://digitalcommons.lib.uconn.edu/dissertations/2134