Date of Completion

8-10-2017

Embargo Period

2-6-2018

Advisors

Kevin Brown, Ph.D., Ion Mandiou, Ph.D., Yong-Jun Shin, M.D., Ph.D.

Field of Study

Biomedical Engineering

Degree

Master of Engineering

Open Access

Open Access

Abstract

The ability to collect and store large amounts of data is transforming data-driven discovery; recent technological advances in biology allow systematic data production and storage at a previously unattainable scale. It is common for biological Big Data to have an order of magnitude or more features than samples. Feature scoring with selection is therefore an essential pre-processing step to finding meaningful clusters in these data. Many feature scoring algorithms have been proposed; they are based on dramatically different ideas about what constitutes a “good” or “important” feature. Motivated by studies in data classification, we use a rank aggregation (RANKAGG) method to combine estimates of feature importance from multiple sources and use a subset of the highest scoring features for subsequent clustering. We demonstrate the performance of RANKAGG on five real-world biological data-sets, and compare the clustering performance of RANKAGG to the thirteen individual feature scoring methods comprising RANKAGG. The rank aggregated features have a mean perfor- mance across the five data-sets equal to the best individual feature scoring method but with lower variance, indicating robust performance across a variety of data. We carefully consider if there is any systematic way to remove rankers from RANKAGG to improve clustering performance. We demonstrate that rank aggregated feature selection yields excellent performance in clustering problems and possibly more im- portantly, greatly limits the risk of choosing a method that is sub-optimal for a given data-set.

Major Advisor

Kevin Brown, Ph.D.

Share

COinS