Date of Completion


Embargo Period



phenotype refinement, heritable component, multi-view clustering

Major Advisor

Jinbo, Bi

Associate Advisor

Henry R. Kranzler

Associate Advisor

Sanguthevar Rajasekaran

Associate Advisor

Dong-Guk Shin

Associate Advisor

Yufeng Wu

Field of Study

Computer Science and Engineering


Doctor of Philosophy

Open Access

Open Access


Unlike univariate phenotypes such as human height, multivariate phenotypes such as substance use disorders, are characterized by multiple low level phenotypic features. Due to the substantial variation in the multivariate features, these phenotypes are heterogeneous. This phenotypic heterogeneity substantially limits the success in uncovering genetic factors of the phenotype. The identification of homogeneous disease subtypes can be both necessary and beneficial. Despite great progress in molecular genetics that allows the genomewide identification of common and rare variants, there is considerably less progress in the refinement of phenotypes.

The most recent and sophisticated phenotype refinement approaches perform unsupervised cluster analysis to partition a sample population into subgroups based only on the phenotypic features. Since genotypic data are not used to guide the derivation of subtypes, the resultant subtypes may differ only in phenotypic features and thus have limited utility in genetic association analyses. In this thesis study, we propose to refine a multivariate phenotype by simultaneously modeling both phenotypic features and genotypic markers. Two integrative approaches are investigated.

In the first approach, we propose a multi-view cluster analysis to identify clusters of subjects that agree across the two views - phenotypic view and genotypic view. Two different algorithms have been developed along this line. Based on multi-objective programming, the first algorithm integrates a cluster analysis on phenotypic data and classification on genotypic data by simultaneously optimizing two objectives: (1) the resultant clusters should differ significantly in phenotypic features; (2) these clusters can be well separated using genetic variants via classifiers. Based on sparse matrix decomposition methods, the second algorithm simultaneously decomposes the two data matrices of phenotypic features and genotypic markers into factorized components that share a common structure. This algorithm jointly groups rows (forming subject clusters) and columns (features that determine the subject clusters) of a matrix, and the resultant row groups are consistent across the two matrices.

In the second approach, we propose to use heritability to guide the subtype derivation. Heritability measures the genetic contribution to the variation of a trait, and is commonly estimated from related individuals in pedigrees. The availability of dense genomewide markers allows heritability to be directly estimated from unrelated individuals and their genomewide single nucleotide polymorphisms (SNPs). We have hence developed two algorithms that identify disease subtypes with high heritability. The first algorithm takes family pedigrees as genetic inputs whereas the second takes genomewide SNPs. Both algorithms derive subtypes as a linear combination of phenotypic features and this combination is obtained by maximizing the likelihood of observing a high pedigree-based or SNP-based heritability.

All proposed algorithms were first validated in simulation studies. The validated algorithms were then used in case studies to analyze real-life datasets that were aggregated from genetic studies of drug dependence including opioid and cocaine dependence. These empirical studies demonstrate the superior performance of the proposed approaches over the state of the art.