To design and evaluate effective architectural mechanisms, we need an in-depth applications study involving a best-effort implementation of each application for each mechanism. Unfortunately, it is difficult to avoid programmer and system bias in such studies. My study copes with bias in two ways. First, my conclusion is counter to my programmer bias. My results favor shared memory even though, as an original developer of the CMMD message-passing library on the CM5, my background has been in message passing and this research started with a clear message-passing bias [CS94] [CSBS94]. Second, I perform all my experiments on the same hardware. The MIT Alewife multiprocessor [A+95] provides a uniquely integrated environment for comparing cache-coherent distributed shared memory, message passing, and DMA.
Even when an application receives no benefit from caching, cache-coherent shared memory hardware is still an extremely efficient communication mechanism. Given an architecture that uses a commodity microprocessor linked with a network interface via some kind of local memory bus, sending a message inherently involves more overhead than performing a shared memory reference. While a shared memory reference involves transferring an address and data across the bus, sending a message requires transferring a destination processor, a data address, the data, and, in the case of Active Messages [vECGS92], a message handler address. The overhead of such messaging is traditionally reduced by increasing the amount of data in each message. In sparse, irregular applications, however, data must be gathered into memory before it can be sent as a large message. This copying overhead, coupled with increased synchronization time, makes bulk transfer unattractive, even when a separate DMA engine is available to transfer data in the background.
Efficient data transfer, however, is only half the story. To make effective use of shared memory, I had to design a new communication model to provide stronger synchronization. Shared memory has weaker synchronization semantics than Active Messages. Active Messages perform actions on the receiving processor when the message arrives, but data transferred via shared memory causes no active synchronization event. I was able, however, to overcome the synchronization difficulties of shared memory by shifting from an owner-computes model of computation to a producer-computes model. This shift led me to help propose a general communication model, Remote Queues (RQ), which provides the semantics of polling active messages on a broad range of platforms. By exposing and naming message queues, RQ integrates polling into complex environments that can include interrupts, DMA, multiprogramming, and global memory. One contribution of this integration is to allow us to have both shared-memory and polling-Active-Message communication models implemented on top of shared-memory hardware. We demonstrate that RQ is useful on Alewife, the CM5, the T3D, and the Paragon.
The key conclusion is that, even for applications with no data re-use, shared memory is a more efficient hardware mechanism than message passing without being harder to use. The RQ model makes this efficiency useful to a broad class of applications and systems, especially data-driven applications previously thought to favor fine-grain synchronization via message passing.
Advisor: Anant Agarwal
Readers: Tom Leighton, Joel Saltz (Maryland), and Rob Schreiber (NASA RIACS).