A Taxonomy of Parallel Computing

Next: Shared Memory I; Processes, Up: An Overview of Cluster Previous: An Overview of Cluster Contents

A Taxonomy of Parallel Computing

A first-tier decomposition of the space of parallel computing architectures may be codified in terms of coupling: the typical latencies involved in performing and exploiting parallel operations. This may range from the most tightly coupled fine-grained systems of the systolic class, where the parallel algorithm is actually hardwired into a special-purpose ultra-fine-grained hardware computer logic structure with latencies measured in the nanosecond range, to the other extreme, often referred to as distributed computing, which engages widely separated computing resources potentially across a continent or around the world and has latencies on the order of a hundred milliseconds.

Systolic computers are usually special-purpose hardwired implementations of fine-grained parallel algorithms exploiting one-, two -, or three-dimensional pipelining. Often used for real-time postsensor processors, digital signal processing, image processing, and graphics generation, systolic computing is experiencing a revival through adaptive computing, exploiting the versatile FPGA (field programmable gate array) technology that allows different systolic algorithms to be programmed into the same FPGA medium at different times.
Vector computers exploit fine-grained vector operations through heavy pipelining of memory bank accesses and arithmetic logic unit (ALU) structure, hardware support for gather-scatter operations, and amortizing instruction fetch/execute cycle overhead over many basic operations within the vector operation. The basis for the original supercomputers (e.g., Cray), vector processing is still a formidable strategy in certain Japanese high end systems.
SIMD (single instruction, multiple data) architecture exploits fine-grained data parallelism by having many (potentially thousands) or simple processors performing the same operation in lock step but on different data. A single control processor issues the global commands to all slaved compute processors simultaneously through a broadcast mechanism. Such systems (e.g., MasPar-2, CM -2) incorporated large communications networks to facilitate massive data movement across the system in a few cycles. No longer an active commercial area,SIMD structures continue to find special-purpose application for postsensor processing.
Dataflow models employed fine-grained asynchronous flow control that depended only on data precedence constraints, thus exploiting a greater degree of parallelism and providing a dynamic adaptive scheduling mechanism in response to resource loading. Because they suffered from severe overhead degradation, however, dataflow computers were never competitive and failed to find market presence. Nonetheless, many of the concepts reflected by the dataflow paradigm have had a strong influence on modern compiler analysis and optimization, reservation stations in out-of-order instruction completion ALU designs, and multithreaded architectures.
PIM (processor-in-memory) architectures are only just emerging as a possible force in high-end system structures, merging memory (DRAM or SRAM) with processing logic on the same integrated circuit to expose high on-chip memory bandwidth and low latency to memory for many data-oriented operations. Diverse structures are being pursued,
- including system on a chip, which places DRAM banks and a conventional processor core on the same chip;
- SMP on a chip, which places multiple conventional processor cores and a three-level coherent cache hierarchical structure on a single chip;
- Smart Memory, which puts logic at the sense amps of the DRAM memory for in-place data manipulation.
PIMs can be used as standalone systems, in arrays of like devices, or as a smart layer of a larger conventional multiprocessor.
MPPs (massively parallel processors) constitute a broad class of multiprocessor architectures that exploit off-the-shelf microprocessors and memory chips in custom designs of node boards, memory hierarchies, and global system area networks. Ironically, "MPP" was first used in the context of SIMD rather than MIMD (multiple instruction, multiple data) machines. MPPs range from distributed-memory machines such as the Intel Paragon, through shared memory without coherent caches such as the BBN Butterfly and CRI T3E, to truly CC-NUMA (non-uniform memory access) such as the HP Exemplar and the SGI Origin2000.
Clusters are an ensemble of off-the-shelf computers integrated by an interconnection network and operating within a single administrative domain and usually within a single machine room.
- Commodity clusters employ commercially available networks (e.g., Ethernet, Myrinet) as opposed to custom networks (e.g., IBM SP-2).
- Beowulf-class clusters incorporate mass-market PC technology for their compute nodes to achieve the best price/performance.
Distributed computing, once referred to as "metacomputing", combines the processing capabilities of numerous, widely separated computer systems via the Internet. Whether accomplished by special arrangement among the participants, by means of disciplines referred to as Grid computing, or by agreements of myriad workstation and PC owners with some commercial (e.g., DSI, Entropia) or philanthropic (e.g., SETI@home) coordinating host organization, this class of parallel computing exploits available cycles on existing computers and PCs, thereby getting something for almost nothing.

Next: Shared Memory I; Processes, Up: An Overview of Cluster Previous: An Overview of Cluster Contents

Cem Ozdogan 2009-01-05