Date of Completion


Embargo Period



Cache design, Energy-efficiency, Large-scale computing systems

Major Advisor

Chun-Hsi Huang

Associate Advisor

Reda Ammar

Associate Advisor

Sanguthevar Rajasekaran

Field of Study

Computer Science and Engineering


Doctor of Philosophy

Open Access

Campus Access


As we approach the era of exascale computing systems, where 1,000-core can be integrated in one die, energy efficiency is the most considerable impediments. Future exascale systems that are capable of executing a thousand times as many operations per second as those of current petascale systems are constrained by a power budget of 20 MW. A representative current supercomputer typically consumes 17.8 MW. Achieving exaflop performance with approximately the same power usage of today’s supercomputers is a major research challenge and will require radical design changes at all levels of the computing stack, such as the circuits, hardware architectures, software, and applications. A key contributor to processor energy consumption is the cache, a small high-speed section of memory that is a key component of the processor, and which plays an important role in minimizing the speed gap between the CPU and the main memory. In this work, we investigate an energy-efficient cache architecture applicable for use in extreme-scale computing systems. Specifically, we investigate an L1 data cache design that allows to save energy without sacrificing performance.

A single large cache consumes more energy per access than smaller caches. Therefore, minimizing cache capacity should reduce energy dissipation. However, such minimization will increase cache misses, leading to performance degradation. Thus, instead of reducing the cache capacity, we propose splitting an L1 data cache into two subcaches, where each subcache stores specific data. This means that only a small subcache is referenced at each data cache access, and therefore a smaller amount of energy is consumed each time. As caches are based on the principle of locality, we advocate classifying data according to whether it is from a program’s stack. This is a useful classification, as the stack memory region has a very high level of locality, and its data are frequently referenced as these data relate to function calls, which are among the most common tasks in a typical program.

We start by studying the data locality of two memory regions, i.e., stack and non-stack. Accordingly, we propose a high-performance non-unified data cache architecture, and evaluate the performance of the proposed non-unified data cache design in comparison to that of a conventional unified data cache. We then examine the effect of various replacement policies on each individual cache in the non-unified design to observe the replacement effectiveness on the overall performance of the proposed cache architecture. Finally, we investigate the energy savings of the non-unified cache architecture.