Zero-Cycle Loads: Microarchitecture Support for Reducing Load Latency

Todd M. Austin, Gurindar S. Sohi

Abstract

For many programs, untolerated load instruction latencies have a significant impact on overall program performance. As one means of mitigating this effect, we present an aggressive hardware-based mechanism that provides support for reducing the latency of load instructions. By speculatively issuing loads early in the pipeline, load results become available earlier and load latency is reduced. Through the application of {\em fast address calculation}, a pipeline optimization that permits effective address calculation and speculative data cache access to proceed in parallel, it is possible to further reduce the latency of load operations. The combination of early issue and fast address calculation permits many load instructions to complete up to two cycles earlier than traditional pipeline designs. On an pipeline with one cycle data cache access, this results is what we term a {\em zero-cycle load}. A zero-cycle load produces a result prior to reaching the execute stage of the pipeline, allowing subsequent dependent instructions to issue unfettered by load dependencies. Programs executing on processors with support for zero-cycle loads experience significantly fewer pipeline stalls due to load instructions and increased overall performance. We present two pipeline designs supporting zero-cycle loads: one for pipelines with a single stage for instruction decode, and another for pipelines with multiple decode stages. We evaluate these designs in a number of contexts: with and without software support, in-order vs. out-of-order issue, and on architectures with many and few registers. We find that our approach is quite effective at reducing the impact of load latency, even more so on architectures with in-order issue, software support, and few registers.

Keywords

memory system optimization, fast address calculation, reducing load, latency, tolerating load latency, memory system design

Talk Overheads (385059 bytes)