Performance modelling and experimental evaluation of systems that perform N tasks using P fault-prone processors in parallel

Date of Completion

January 2002


Computer Science




This thesis presents a family of Markov models for analyzing the performance of parallel/distributed systems that execute a job consisting of N independent and idempotent tasks using P fault-prone processors in parallel. A prototype implemented using an extended version of ACMPI is used for actual experiments that are based on simulated task-times and processor failures. The model is a Markov Chain with states representing service and failure rates with k (0 < k P) active processors. The task-times and processor failures are both exponentially distributed. A number of formulas/algorithms are derived for determining the probability of system failure, average number of processor failures, failure distribution and mean time to failure, mean execution time, work, and other measurable quantities. Since the set of tasks to be processed is fixed (i.e. there is no arrival process), and there is no repair process, there is a finite probability that the job will never finish. Therefore, the performance parameters must be conditioned on the job finishing successfully. Results are presented by comparing the analytic model with the prototype for a range of values of processor failure rates. ^