New development of Bayesian mixture models for survival and survey data

Date of Completion

January 2008


Biology, Biostatistics|Statistics




The mixture models are becoming very popular in a variety of disciplines such as biometrics, econometrics, and social sciences. The flexible building of the model and comprehensive understanding of the data structure play an important role in modern statistical data analysis. ^ This thesis focuses on the new development of Bayesian mixture models with applications to survey data and survival data. By illustrating with real data applications, my dissertation addresses several aspects in Bayesian modeling and computation for analyzing data with a complex structure. ^ In this dissertation, we propose the models with three applications. First, we consider a new mixture model for misidentification with application to survey data. An inherent problem in survey data is the potential misclassification of group membership. In this study, we develop a new mixture model that allows researchers to address the problem by supplying additional information to the data being analyzed. As anticipated, the more information we supply to adjust for group membership, the better the model fits. ^ Second, we extend this mixed structure to model survival data with a cure fraction via latent cure rate markers. We propose a new mixture model via latent cure rate markers for survival data with a cure fraction. In the proposal model, the latent cure rate markers are modeled via a multinomial logistic regression. The proposed model assumes that the patients may be classified into several risk groups based on their cure fractions. Based on the nature of the proposed model, a posterior predictive algorithm is also developed to classify patients into different risk groups. The proposed model not only bears more biological meaning, but also fits the data much better than several existing competing cure rate models based on the popular LPML measure. ^ Third, we develop a new mixture proportional hazard model for misidentification with application to prostate cancer data, where the misclassified variable is the biopsy Gleason score and the "true Gleason score" is modeled via the multinomial logistic regression. The proposed model has the potential to rectify the misclassification in prostate cancer diagnosis. ^ Due to the nature of the problems and the complexity of the mixture models, we employ the Bayesian approach to carry out all inferences. The theoretical properties of the proposed mixture models are carefully examined and the different criteria for model assessment for these three different types of mixture models are developed. ^