Date of Completion

8-18-2017

Embargo Period

8-23-2017

Major Advisor

Omer Khan

Associate Advisor

John Chandy

Associate Advisor

Marten Van Dijk

Field of Study

Electrical Engineering

Degree

Doctor of Philosophy

Open Access

Open Access

Abstract

The ever-increasing miniaturization of semiconductors has led to important advances in mobile, cloud and network computing. However, it has caused electronic devices to become less reliable and microprocessors more susceptible to transient faults induced by radiations. These intermittent faults do not provoke permanent damage, but may result in incorrect execution of programs by altering signal transfers or stored values. These transitory faults are also called soft errors. As technology scales, researchers and industry pundits are projecting that soft-error problems will become increasingly important. Today’s processors implement multicores, featuring diverse set of compute cores and on-board memory sub-systems connected via networks-on-chip and communication protocols. Such multicores are widely deployed in numerous environments for their computational capabilities.

To protect multicores from soft-error perturbations, resiliency schemes have been developed with high coverage but high power and performance overheads. It is observed that not all soft- errors affect program correctness, some soft-errors only affect program accuracy, i.e., the program completes with certain acceptable deviations from error free outcome. Thus, it is practical to improve processor efficiency by trading off resiliency overheads with program accuracy. This thesis explains the idea of declarative resilience that selectively applies resiliency schemes to both crucial and non-crucial code. At the application level, crucial and non-crucial code is identified based on its impact on the program outcome. A cross-layer architecture is developed, through which hardware collaborates with software support to enable efficient resilience with holistic soft- error coverage. Only program accuracy is compromised in the worst-case scenario of a soft-error strike during non-crucial code execution.

Share

COinS