Zero-Cycle Loads: Microarchitecture Support for Reducing Load Latency
Todd M. Austin, Gurindar S. Sohi 
Abstract
For many programs, untolerated load instruction latencies have a
significant impact on overall program performance. As one means of
mitigating this effect, we present an aggressive hardware-based
mechanism that provides support for reducing the latency of load
instructions.
By speculatively issuing loads early in the pipeline, load results
become available earlier and load latency is reduced. Through the
application of {\em fast address calculation}, a pipeline optimization
that permits effective address calculation and speculative data cache
access to proceed in parallel, it is possible to further reduce the
latency of load operations. The combination of early issue and fast
address calculation permits many load instructions to complete up to
two cycles earlier than traditional pipeline designs. On an pipeline
with one cycle data cache access, this results is what we term a {\em
zero-cycle load}. A zero-cycle load produces a result prior to reaching
the execute stage of the pipeline, allowing subsequent dependent
instructions to issue unfettered by load dependencies. Programs
executing on processors with support for zero-cycle loads experience
significantly fewer pipeline stalls due to load instructions and
increased overall performance.
We present two pipeline designs supporting zero-cycle loads: one for
pipelines with a single stage for instruction decode, and another for
pipelines with multiple decode stages. We evaluate these designs in a
number of contexts:  with and without software support, in-order vs.
out-of-order issue, and on architectures with many and few registers.
We find that our approach is quite effective at reducing the impact of
load latency, even more so on architectures with in-order issue,
software support, and few registers.
Keywords
memory system optimization, fast address calculation,
reducing load, latency, tolerating load latency, memory system design
 Talk
Overheads  (385059 bytes)