تجزیه و تحلیل عملکرد از اتوبوس های مبادله چندپردازنده نوری همزمان
کد مقاله | سال انتشار | تعداد صفحات مقاله انگلیسی |
---|---|---|
27564 | 2001 | 37 صفحه PDF |
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Parallel Computing, Volume 27, Issue 8, July 2001, Pages 1079–1115
چکیده انگلیسی
The performance of a multi-computer system based on the simultaneous optical multi-processor exchange bus (SOME-Bus) interconnection network is examined using queuing network models under the message-passing and distributed-shared-memory (DSM) paradigms. The SOME-Bus is a low latency, high bandwidth, fiber-optic interconnection network which directly links arbitrary pairs of processor nodes without contention. It contains a dedicated channel for the data output of each node, eliminating the need for global arbitration and providing bandwidth that scales directly with the number of nodes in the system. Each of N nodes has an array of receivers, with one receiver dedicated to each node output channel. No node is ever blocked from transmitting by another transmitter or due to contention for shared switching logic. The entire N-receiver array can be integrated on a single chip at a comparatively minor cost resulting in O(N) complexity. By supporting multiple simultaneous broadcasts of messages, the SOME-Bus has much more functionality than a crossbar, allowing synchronization phases and cache consistency protocols to complete much faster. Simulation results are presented which validate the theoretical results and compare processor utilization in the SOME-Bus, the crossbar and the torus, with and without synchronization. Compared to these two networks, the SOME-Bus performance is least affected by large message communication times. Even in the presence of frequent synchronization, processor utilization remains practically unaffected while it drops in the other architectures. Although it has a larger number of channels compared to the crossbar and the mesh, the SOME-Bus is much simpler and inexpensive because it is free of complex routing, congestion and blocking.
مقدمه انگلیسی
High performance computing is required for many applications, including simulation of physical phenomena, simulation of integrated circuits and neural networks, weather modeling, aerodynamics, and image processing. It has been relying increasingly on microprocessor based computer nodes which are rapidly becoming the technology of preference. Groups of these nodes are interconnected to form a distributed-memory multi-computer system. Such systems are scalable and capable of high computing power. Difficulties in the design and use of such systems arise because the required computers need interconnection networks with high bisection bandwidth and low latency to connect hundreds of nodes. Current massively parallel computers use small degree networks with large diameters. Wormhole routing achieves lower latency when the network is not heavily loaded. Often performance is poor or moderate with many modern large-scale applications, mostly due to load imbalance, barrier synchronization, and communication patterns, such as all-to-all communication [26], which place an excessive load on the interconnection network. The major reason of the moderate success lies in the nature of currently available interconnection topologies (trees, hypercubes and mesh networks), regardless of their actual implementation medium, and in the mismatch between interconnection architecture and application structure. Most applications have irregular and dynamic communication patterns which change as the application progresses. A review of many researchers' experiences with large applications on modern multi-computers reveals that, even after extensive efforts of software tuning, only moderate success is encountered. Experiments have been performed in various multi-processor systems including IBM SP-1 and SP-2, the Cray T3D, Thinking Machines CM-5, Intel Paragon and Delta, and the SGI Power Challenge XL. Researchers have studied communication patterns and costs, and the effects of load imbalance on the performance of various applications, including atmospheric models (chemical tracer, general circulation), three-dimensional Navier–Stokes solvers, and N-body simulations. It has been observed that processing in this type of applications is based on two-dimensional and three-dimensional FFT and requires a series of complete exchange operations, as well as global reduction and gather operations. Individual steps are followed by global synchronizations. Researchers find moderate to severe performance degradation [19], increasing effects of load imbalance and communication costs as the number of processors is increased [16], processor utilization between 30% and 40% even when attempting to minimize interprocessor communication through the use of a substantial amount of (infrequently performed) local bookkeeping [34], or even with code that has been optimized for years [5]. There is a large amount of research in multi-cast communications in popular architectures with path-based broadcasting [27], and trees and (multi-destination) wormhole routing [13], [31] and [37]. Large efforts are focused on development of extensive algorithms to alleviate the fact that intense multi-cast communications cause wormhole routing to resemble store-and-forward routing. Experiments on the Paragon, SP-2 and CS-2 using the multi-phase complete exchange are described in [10], where only poor or moderate agreement to theoretical predictions is observed and attributed to complex effects including link and switch contention and operating system overhead. Barrier synchronization occurs frequently in programs for multi-processors, especially in those problems that can be solved by using iterative methods. Efficient implementation of barrier synchronization in scalable multi-processors and multi-computers has received a lot of attention. Synchronization based on multi-destination wormhole routing and additional hardware support at the network routers is presented in [43]. Additional hardware designs in support of efficient synchronization are presented in [47] and [55]. These designs increase the hardware complexity and cost of the router at each node. In distributed-shared-memory (DSM) systems, an important objective of current research is the development of approaches that minimize the access time to shared data, while maintaining data consistency. Their performance depends on how restrictive the memory consistency model is. Models with strong restrictions result in increased access latency and network bandwidth requirements. Sophisticated models with weaker constraints have been proposed and implemented. They allow reordering, pipelining, and overlapping of memory accesses and consequently produce better performance, but also require that accesses to shared data be synchronized explicitly, resulting in higher programmer involvement and inconvenience. Although such models are useful, they require additional programmer effort as in the message-passing paradigm. The success of DSM depends on its ability to free the programmer from any operations that are necessary for the only purpose of supporting the memory model, and therefore it is critical that interconnection networks be developed, connecting hundreds of nodes with high bisection bandwidth and low latency, that result in the least possible adverse impact on DSM performance. In [39], data indicates that the response time becomes unacceptably large when more than five threads are present in each processor. Large values of remote memory request probability cause the interconnection network to saturate resulting in processor utilization below 35%. Similar utilization is observed in [7] and [36], where the authors observe that utilization drops dramatically, as the request probability increases, due to higher amounts of traffic and queuing delays. Different degrees of speedup are observed in [50] depending on the application, with degradation attributed to irregular and dynamic data access patterns, global communication between processors and load imbalance. Simulation is used in [29] to study the performance of DSM and a strong negative effect of network latency on execution time is shown. In [2], the authors find much reduced performance in an application which performs matrix addition on distributed data and requires a significant amount of data-transfer time compared to node-intensive computation time. In [23], the authors report a study of four architectures with hardware support of shared memory. After making assumptions on the workload parameters used for comparison, they measure the impact of system size, temporal locality and spacial locality on processor utilization and shared-memory latency. Even though their assumptions appear optimistic, especially in characterizing the cache misses which result in traffic over the interconnection network (as evidenced by the very small network utilization in three architectures), they find significant latency compared to the CPU clock cycle. They also find that locality has a significant effect on processor utilization. In [15] the authors study two hardware-based prefetching schemes and evaluate their performance with simulation of a shared-memory multi-processor based on a 4×4 mesh network with wormhole routing. They use five popular applications and find cache misses in the range from 1% to 10%.
نتیجه گیری انگلیسی
Excessive communications in many large-scale parallel applications result in severe performance degradation, as system and application size increases. Extensive research in common interconnection networks indicates reasonable performance when the network is lightly loaded. Performance is drastically reduced when network traffic becomes heavier (due to more frequent cache misses, for example), leading some researchers to reject many applications as unsuitable for large-scale parallel systems. It is critical that interconnection networks with better properties be developed, especially since applications become larger and can potentially be distributed over a larger number of processors, with the resulting heavier network traffic. Advances in optical technology have made the SOME-Bus interconnection network a realistic, highly competitive candidate which achieves the necessary high bandwidth, low latency and large fan-out, and promises to deliver the necessary performance. Its power is due to the fact that each processor is one unit distance away from any other processor, with communications requiring no routing and no arbitration. No node is ever blocked from sending a message to any other node. The bandwidth of the system scales with the number of processors and as optical technology matures, the available bandwidth will increase. Fairly modest implementations provide bandwidths comparable to the capabilities of other state-of-the-art architectures. Its simplicity results in reduced cost; although it works as a fully connected network, the SOME-Bus grows by O(N) and minimizes both the number of the expensive transmitters and the cost of the highly replicated receivers. The system operation at the receiving ends is further enhanced by the design of the receiver logic to filter out messages that do not need to be serviced by the processor. The receiver logic is designed to efficiently support the common programming models used on parallel computers and can be easily realized using current VLSI technology. Consequently, parallel programming is greatly simplified. Theoretical and simulation results are used to compare the SOME-Bus architecture to other popular architectures in order to examine the extent of support provided to modern large-scale applications which make use of task allocation and barrier synchronization. The simulation results clearly indicate that the SOME-Bus architecture is much less sensitive to heavy traffic and offers distinct advantages over other architectures. As the number of processors and the size of the transferred data increase, waiting times within the system are stable or increase slowly in the SOME-Bus systems while they increase rapidly in other systems. Even in very heavy message traffic, processor utilization in a SOME-Bus-based system stays above 70%, while latency is almost negligible compared to latency observed in the torus and crossbar architectures. Under these conditions, the useful work produced by other systems deteriorates rapidly, but remains stable in the SOME-Bus system. In addition, in the DSM paradigm, invalidation messages generated by write accesses of local memory by the local processor, create little interference with the other messages transferred on the SOME-Bus and therefore cause no reduction in performance. This property is not observed in other common interconnection networks.