Date of Completion


Embargo Period



Apache Spark, Performance Modeling, Performance Prediction, Performance Interference, Job Scheduling, Straggler, Performance Optimization, Resource Allocation

Major Advisor

Mohammad Maifi Hasan Khan

Associate Advisor

Swapna Gokhale

Associate Advisor

Song Han

Field of Study

Computer Science and Engineering


Doctor of Philosophy

Open Access

Open Access


Software service providers are increasingly adopting cloud-based solutions to maximize resource utilization while minimizing operating cost. While performance predictability is becoming of paramount importance as the safety-critical nature of such systems continues to grow (e.g., IoT applications, infrastructure monitoring), however, large scale, high-degree of concurrency, and dynamic allocation of resources are making traditional performance modeling/tuning frameworks ill-suited that are not extendable. To address the aforementioned challenge, this thesis focuses on developing a data-driven performance modeling framework. Towards this objective, first, hierarchical performance models that can effectively capture and predict the execution time of a given job with high accuracy based on limited scale execution data are first developed. Subsequently, the models are extended to account for the underlying interactions among multiple jobs and predict the execution time of a job when interfered with other jobs. The extended models are then leveraged to design and implement a dynamic job scheduler that can automatically predict potential interference, and reschedule them to minimize interference and job execution time significantly. Second, analytical models are developed to predict the possibility of suboptimal performance problems caused by inefficient partition of input data and/or skewed task distribution across worker nodes, and recommend ways to address the identified problems by either repartitioning of input data (in case of task straggler problem) and/or changing the locality configuration setting (in case of skewed task distribution problem). Finally, the thesis focuses on dynamically allocating computing resources for cloud platforms, which leverages kernel-level application-specific resource usage metric to allocate resources dynamically to improve application performance while reducing resource requirements significantly compared to static resource allocation strategies. The effectiveness of our approach is evaluated on a real cluster using Apache Spark jobs, and is presented in the thesis. We believe that the presented approach will guide future research, and help to improve resource utilization while reducing operating costs significantly in cloud settings.