Date of Completion


Embargo Period



tree-based models, predictive model of insurance claims, variable annuity, hybrid tree-based models

Major Advisor

Emiliano A. Valdez

Associate Advisor

Guojun Gan

Associate Advisor

Yuwen Gu

Field of Study



Doctor of Philosophy

Open Access

Open Access


Tree-based models are supervised learning algorithms broadly described by repeated partitioning of the regions of the explanatory variables to form homogeneous groups. The partitioning is based on minimization of a loss function related to the response variable. The results form and create a tree-based structure, which helps make for better model interpretation, for predicting the response. Because of the many advantages of tree-based models, their use in disciplines like engineering, biostatistics, and ecology has been a popular alternative predictive tools for building classification and regression models. A single decision tree may not produce accurate predictions, thereby, we also examine the benefits of ensemble methods (e.g., random forests, boosting) for which we produce several trees to improve accuracy. We also describe procedures of tuning model parameters to further improve predictive accuracy. In this thesis, we explore the many potential uses of tree-based models in actuarial science and insurance. First, in valuing large portfolios of variable annuities, we examine the performance of tree-based methods as alternative metamodels for calculating associated guarantees embedded in these products. Simulation procedures have been the norm, but tree-based models produce accurate and efficient results that drastically reduce the time needed to produce valuation results. Second, for claims predictions in general insurance, we develop the innovative approach of producing hybrid tree-based models, which can be described as a two-step procedure. The first step develops a classification tree-based model for the frequency component, and the subsequent step builds an elastic net regression model for the severity component. This regression is done at each terminal node produced from the classification tree. The resulting hybrid tree structure captures the many benefits of tree-based models and is proposed as an improvement to the existing Tweedie generalized linear model (GLM) widely popular in practice. Finally, we apply multivariate tree models to multi-line insurance claims data with correlated responses. The literature on the theory and relevant uses of building trees with multivariate response is less numerous. However, in building trees as predictive models with multivariate response, we find the potential benefits of better understanding inherent relationships among the several responses and even improvement in marginal predictive accuracy. In the future, to better accommodate the peculiar characteristics of multivariate claim responses, we will further investigate tree-based models using alternative multivariate loss functions.