Date of Completion


Embargo Period



Machine Learning, Matrix Completion, Generative Adversarial Nets, Provable Methods

Major Advisor

Jinbo Bi

Associate Advisor

Alexander Russell

Associate Advisor

Sanguthevar Rajasekaran

Field of Study

Computer Science and Engineering


Doctor of Philosophy

Open Access

Open Access


In the recent decade, machine learning has been substantially developed and has demonstrated great success in various domains such as web search, computer vision, and natural language processing. Despite of its practical success, many of the applications involve solving NP-hard problems based on heuristics. It is challenging to analyze whether a heuristic scheme has any theoretical guarantee. In this dissertation, we show that if a certain structure occurs in sample data, it is possible to solve the related problem with provable guarantees. We propose to employ granular data structure, e.g. sample clusters or features describing an aspect of the sample, to design new statistical models and algorithms for two learning problems. The first learning problem deals with the commonly-encountered missing data issue by formulating it as a matrix completion problem. When side features describing the data entities are available, we propose a new convex formulation to construct a bilinear model that infers the missing values based on the side features. This approach can be proved that with a much lower sampling rate than that of the classic matrix completion methods, it can exactly recover or epsilon-recover missing values, depending on whether the side features are corrupted. A novel linearized alternating direction method of multipliers is developed to efficiently solve the proposed convex formulation. For the second learning problem, we build a new generative adversarial network (GAN) to generate data that follow a distribution much closer to the true data distribution than the standard GAN when the data contains underlying clusters. The proposed model consists of multiple smaller GANs as components, each corresponding to a data cluster identified automatically during the construction of the GAN. This GAN approach can recover the true distribution for every cluster if an appropriate Kolmogorov regularization is used. If the GAN complexity is regularized by smoothness with a parameter epsilon, we prove that GAN model can approximate the true data distribution with an epsilon tolerance. We use the Adaptive Momentum (ADAM) algorithm to optimize this model with scalability. The proposed two approaches essentially bring new insights and suggest new methods for provable and scalable machine learning.