Clustering, classification and function estimation for high dimensional data arising from bioinformatics and related domains

Date of Completion

January 2006


Statistics|Biology, Bioinformatics




With the recent advent of computer technology, a new paradigm has began where complex biological system can be analyzed in a more useful fashion. In short, this mixing of computational and biological science popularly known as "omits" era of biomedical research often produces "modern data", which is high-dimensional, noisy and contains a lot of irrelevant predictors. As the new technology shows it's promises, it also throws exciting challenges. As it often happens, with the excitement of new technology basic principles of reproducibility, scalability and other, design issues are often undermined. Moreover due to high confounding of different biological as well as technological factors and poor signal to noise ratio, traditional statistical analysis fails to capture the true "signal" in the data. This is high time we see a pressing need of new statistical as well as algorithmic development to combat these issues for successful statistical modeling of "modern data". Robustness, Regularization, Scalability and Adaptive learning are some of the key concepts that may help us to achieve this. These are also common threads among different problems being considered in this thesis. Three main problems which are covered in this context are: (1) Function Estimation: We have tackled the problem of function estimation for high throughput mass spectroscopy. Rather than traditional data modeling we have proposed process modeling through semi-parametric function estimation approach. We proposed benchmark profiling for the purpose of diagnosis, prognosis and monitoring of disease status successfully. The proposed methodology also suggests a natural way to select irregular pattern which can be further investigated for biomarker discovery. The process of selection of the statistically significant concomitant variables are also integrated in the proposed semiparametric framework. (2) Clustering: Clustering is a very common data analytic problem which find its application not only in bioinformatics but almost in every data analysis field that we can think of. We proposed a novel scalable solution for model based clustering in high dimensional data. The cluster number is assumed to be an unknown quantity to begin with. However instead of using traditional reversible jump algorithm we have developed a scalable algorithm to estimate the cluster member as well as cluster number in an unified framework. (3) Classification: We have developed an innovative solution for classification in high dimensional domain. Support vector machine (SVM) based on Reproducing Kernel Hilbert Space (RKHS) and its different variation is an extremely successful methodology for classification due to its robustness and generalization capability. However SVM does not consider dimension filtering rather it uses all available dimensions to construct nonlinear classifier. We have developed a new statistical algorithm namely Dimension Augmenting Vector Machine (DAVM) to construct an approximate submodel for classification by minimally selected dimensions in the original feature space. Though special emphasis is given on the domain related to bioinformatics our proposed methodology can be applied to other fields where we face problems of similar kind.^